Friday, July 18, 2014

false GPU overhearting warning on esxi 5.5

Issue :
there are 2 clusters with 10 hosts in each, amounting a total of 20 esxi 5.5 bl460c gen8 in c7000 enclosure.
All of them are showing a temperature warning for “add in card 10 35 gpu 2” under hardware status.
There are no graphics cards.

What worked :
Clear the ipmi logs from the vcli command
localcli hardware ipmi sel clear

What didnt work :
since it is happening on 20 hosts at the same time and the VMware OS or the hardware cannot go bad on all the servers at the same time so it must be a false positive.
/etc/init.d/hp-ams.sh stop
disconnect and reconnect the host, refresh the hardware status page but no go.
Connected directly to 1 of the host via the vSphere client but no go.
https://communities.vmware.com/message/2379288
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2076665
Suggested to install the VMware esxi 5.1u1 and then the heartbleed bugfix as per the kb article and monitor it for the error.
The error is not instantaneous.

update 24/7/2014
apparently the fix was temporary and it seems we are still chasing the cause and solution.