Friday, July 18, 2014

false GPU overhearting warning on esxi 5.5

Issue :
there are 2 clusters with 10 hosts in each, amounting a total of 20 esxi 5.5 bl460c gen8 in c7000 enclosure.
All of them are showing a temperature warning for “add in card 10 35 gpu 2” under hardware status.
There are no graphics cards.

What worked :
Clear the ipmi logs from the vcli command
localcli hardware ipmi sel clear

What didnt work :
since it is happening on 20 hosts at the same time and the VMware OS or the hardware cannot go bad on all the servers at the same time so it must be a false positive.
/etc/init.d/ stop
disconnect and reconnect the host, refresh the hardware status page but no go.
Connected directly to 1 of the host via the vSphere client but no go.
Suggested to install the VMware esxi 5.1u1 and then the heartbleed bugfix as per the kb article and monitor it for the error.
The error is not instantaneous.

update 24/7/2014
apparently the fix was temporary and it seems we are still chasing the cause and solution. 

Thursday, July 17, 2014

Mixed SIOC environment causes DRS migrations to fail

Issue : The VM migration failed continuously for a vm causing it to freeze and power off.

Why ?: Apparently few of the hosts of the cluster didnt have the enterprise or required license to enable SIOC and thus the DRS was trying just trying to migrate it to the host which unfortunately didnt have the SIOC enabled because it didnt have the enterprise license.
So if the target host doesnt support SIOC, source host from which the VM is moving has SIOC then you will face this problem. I am yet to see such more incidents to conclude that DRS is not  SIOC aware. May be some giants should shed some light on it. Obviously they werent using SIOC on all of the datastores but you see the conflict of interest when some hosts are relying on SIOC to make decision on the same datastores on which the SIOC isnt enabled/used for other hosts of the same cluster.

Resolution: upgrade the remaining few hosts with the enterprise license and enable SIOC on them too.

ESXi 5.x datastore browsers shows no data

Issue: The hosts are being slowly upgraded to 5.1 from 4.1 (clean install) but the upgraded hosts dont see any data when browsed from the datastore browser.
If we browse from the datastore browser launched from the vsphere client then it shows up the data for all the hosts.
If we browse from the datastore browser launched from the vcenter server then it doesnt show up the data for few of the 5.1 hosts.
the storage is being migrated from clariion to ibm Storwize V7000 FC.

Resolution : Disable ATS on the storage.
Disable vmfs3.HardwareAcceleratedLocking in advanced settings.
Disable VAAI too
Some of volumes mounted ATS only mode - designed for older storage techniques
To change this, it would require taking the datastore out of production.

What didn't work:
rescaning all the hosts hasnt helped.
there are no special characters in the folde names since they are all alpha numeric.
./sbin/ restart
refreshed the storage and tried browse the datastores but no go.
disconnect and reconnect the host, rescan all but no go.
the drivers/firmware are up to date.
readded the hosts but it works temporarily and the issue comes back again after a reboot.

HP Blades with ESXi 5.x fail to display hardware status!

Issue : When you go to hardware status of the esxi 5.x host we get an error "hardware monitoring service on this host is not responding or not available".

Cause : iLO 2 firmware version 2.07

Resolution: Upgrade iLO 2 firmware version to 2.09/2.15/2.25 or higher.

Image: HP Custom Esxi 5.x

What didn't work:
it is not a DMZ site.
/etc/init.d/sfcbd-watchdog restart
got an error in the command saying
< sh: bad number
sh: you need to specify whom to kill >
reconnected the host and checked the hardware status but got an error
"hardware status communication error with the server"
tried to restart the sfcbd-watchdog process but the command got hung.
checked firewall setting by looking Configuration (Tab) -> Security Profile -> Firewall; according to the Firewall page, the CIM Server service runs on both TCP ports 5988 and 5989.
disabled symantic end point, restarted the inventory service, did a update/refresh of the hardwae status tab but no go.
the hardware status plugin 5.5 is installed and enabled in the vcenter server.
but no go.
tried restarting the cim server but it failed with an error saying the remote server took too long.
tried stopping the cim server but it failed with the same error.
issue seems to be with the servers but not vcenter since vcenter is working with 2 hosts.
Applied all the HP patches and udpates for the host via the update manager but no go.

Tuesday, July 15, 2014

Redhat p2v produces ustable VMware VM

Issue : customer did a p2v of a redhat 6.x and the vm had troubles in booting up. the boot process failed saying "no fstab.sys , mounting internal default"

Resolution : Disable the SElinux and p2v produces a nice virtual vm of the physical redhat counterpart.

Tuesday, July 8, 2014

vCenter Datastore Browser is empty

Issue : When a client of ours upgraded their hosts to 5.1 from 4.1 they saw that they can't see the data inside the IBM V7000 FC storage datastores through datastore browser but they can see the data if they connect directly to the esxi host using the vsphere client.

What worked : Remove and Re add the host to the vcenter.

Why? : The fact that it works when connected directly to the host but not through the vcenter gives us a hint that it is the vcenter's agent (vpxa) which is installed in the host is not sending proper information to the vcenter. We somehow needed to reinstall it and the only way to do it was re add the host to the vcenter server.

What didn't work or might work for you:
./sbin/ restart
the above command will restart all services in the host including the vpxa (vcenter's agent).
disconnect, reconnect the host to the vcenter server. This will reconfigure the vpxa agent in the host.

What I referred:

BTW someone else resolved the same issue by powering on all the hosts back on which were put in standby mode by the DPM...really crazy!

Monday, July 7, 2014

Esxi error saying tmp is full

Issue :  you get an esxi error saying tmp is full but if you clear it then it will happen again after some time.
cause : The storage adapter's logs fill up the space.
it was
 but in my client's case it was Qlogic.
we udpated the qlogic driver/firmware and boom, the log cleared off itself and all is well now.

Fedora 20 or RHEL 7 Can't mount ntfs drives

Issue : My windows 8 & Fedora 20 dual boot adventure went soar when the fedora 20 wasn't mounting the ntfs drives of windows 8.
What worked : ntfsfix /dev/sdx1
(of course you need to have your ntfs-3g and ntfsprogs installed in your fedora for this to work)
what didnt work : disable fast restart in windows and do a clean shutdown.
mount the drives as read only.
Note :
If you are using fedora 20 or higher then i highly recommend you install the following repositories first
rpmfusion too if the above are not serving your thirst.

Friday, July 4, 2014

Change the default multipathing policy to round robin on Esxi 5.x

This is something for me to look back and check if i need it in future. how to change the default multipathing policy on the host for all the existing and future luns/datastores.
find out the satp on your esxi
esxcli storage nmp satp list

set the default mpp (multi pathing policy) to  round robin
esxcli storage nmp satp set --default-psp VMW_PSP_RR --satp VMW_SATP_EQL

 set the default mpp (multi pathing policy) to fixed
esxcli storage nmp satp set --default-psp VMW_PSP_FIXED --satp VMW_SATP_EQL

set the default mpp (multi pathing policy) to MRU [Most Recently Used]
esxcli storage nmp satp set --default-psp VMW_PSP_RR --satp VMW_SATP_EQL