Thursday, June 12, 2014

VMware HA agent unreachable :(

Issue : one of the host has an ha error message
"the vsphere ha agent on the host cannot be reached.
this condition indicates that
1)a situation exists which is preventing the agent on the host from running or existing the uninitialized state or
2)vcenter server is unable to connect to any of the agents running on the cluster hosts due to a networking failure or total of cluster failure."

What really worked :
Disable HA on the cluster.
restarted all hosts in the cluster (one by one after moving off all the VMs).
remove hosts from the cluster.
Enable HA on the cluster and make sure check ssl cert is enabled.
add hosts back to the cluster.

What should have worked:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1019200
all hosts in the cluster have the same management network configuration.

it is a new installation (3 weeks old) and it hasnt worked properly since then.
forward and reverse nslookup works from the vcenter to the hosts.
using telnet made sure the 902 port is open to the esxi hosts from the vcenter server.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001596
http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1003735
updated the vcenter ip under runtime settings, reconnected the host but the operation timed out.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001493
the vpxa.cfg has the right ip addresses.
ntp and time sync are fine.
there are no advanced configurations set for ha.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2011974 but no go.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2008609
fdm.log has Error message "[ClusterManagerImpl::IsBadIP] x.x.x.x is bad ip" showing in /var/log/fdm.log on ESXi hosts.
http://tech.zsoldier.com/2012/06/esxi-hosts-timing-out-during-ha-cluster.html
 the vm, management network all are on the same vlan and there isnt a firewall configured between the hosts.

hostd.log entries
"http transaction failed on stream tcp (error:transport endpoint is not connected) with error n7vmacore15systemexceptione(connection reset by peer)"

fdm log entries
2014-06-09T13:50:32.006Z [FFEB9B90 verbose 'Cluster' opID=SWI-6058ed8] [ClusterManagerImpl::IsBadIP] x.x.x.x is bad ip.

Found SSL related errors:
2014-06-09T13:51:23.069Z [6DD59B90 error 'Message' opID=SWI-29e297b3] [MsgConnectionImpl::FinishSSLConnect] Error N7Vmacore16TimeoutExceptionE(Operation timed out) on handshake
2014-06-09T13:51:24.842Z [6DD18B90 error 'Message' opID=SWI-5992c13d] [MsgConnectionImpl::FinishSSLConnect] Error N7Vmacore16TimeoutExceptionE(Operation timed out) on handshake
2014-06-09T13:51:42.841Z [6DE1CB90 error 'Message' opID=SWI-2f2d0b51] [MsgConnectionImpl::FinishSSLConnect] Error N7Vmacore16TimeoutExceptionE(Operation timed out) on handshake
2014-06-09T13:51:43.071Z [6DD59B90 error 'Message' opID=SWI-29e297b3] [AcceptorImpl::FinishSSLAccept] Error N7Vmacore16TimeoutExceptionE(Operation timed out)
creating ssl stream or doing handshake

2014-06-09T14:03:58.959Z [6DCD7B90 info 'Cluster' opID=SWI-4b7216e3] [ClusterManagerImpl::VerifyHost] Untrusted thumbprint (02:2D:63:09:48:E3:D8:7F:94:C1:7A:
FB:11:12:B7:C7:EB:F5:20:3F) for host 10.1.100.233 - failing verify
2014-06-09T14:04:59.032Z [6DD18B90 info 'Cluster' opID=SWI-18eb3cb4] [ClusterManagerImpl::VerifyHost] Untrusted thumbprint (02:2D:63:09:48:E3:D8:7F:94:C1:7A:
FB:11:12:B7:C7:EB:F5:20:3F) for host 10.1.100.233 - failing verify

2014-06-09T13:42:05.513Z [6DD9AB90 verbose 'HttpConnectionPool-000001'] [RemoveConnection] Connection removed; cnx: <SSL(<io_obj p:0x0d9062cc, h:-1, <TCP '0.0.0.0:0'>, <TCP '127.0.0.1:443'>>)>; pooled: 0
2014-06-09T13:24:30.312Z [FFC92B90 verbose 'HttpConnectionPool-000001'] [RemoveConnection] Connection removed; cnx: <SSL(<io_obj p:0x04d1117c, h:-1, <TCP '0.0.0.0:0'>, <TCP '127.0.0.1:443'>>)>; pooled: 0
2014-06-09T13:56:23.892Z [FFE15460 verbose 'HttpConnectionPool-000000'] [RemoveConnection] Connection removed; cnx: <SSL(<io_obj p:0x0d90316c, h:-1, <TCP '0.0.0.0:0'>, <TCP '127.0.0.1:443'>>)>; pooled: 2

2014-06-09T13:32:58.357Z [FFBEE460 error 'Message' opID=SWI-14a96433] [AcceptorImpl::FinishSSLAccept] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:140000DB:SSL routines:SSL routines:short read) creating ssl stream or doing handshake --> * unable to get local issuer certificate) on handshake
2014-06-09T13:33:59.431Z [FFF5CB90 error 'Message' opID=SWI-77ccbfb7] [AcceptorImpl::FinishSSLAccept] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:140000DB:SSL routines:SSL routines:short read) creating ssl stream or doing handshake

vpxd log:

During election:

2014-06-09T14:25:47.648+01:00 [05472 error 'DAS' opID=D428CBEC-00001580-9b-1d] [VpxdDasConfigLRO::Config] Timed out waiting for election to complete or for host to join existing master
2014-06-09T14:25:47.648+01:00 [05472 error 'DAS' opID=D428CBEC-00001580-9b-1d] [VpxdDasConfigLRO::Config] EnableDAS failed on host [vim.HostSystem:host-1476,uk-mal-esx-p05.dyson.global.corp]: class Vim::Fault::Timedout::Exception(vim.fault.Timedout)
2014-06-09T14:25:47.648+01:00 [05472 error 'DAS' opID=D428CBEC-00001580-9b-1d] [VpxdDasConfigLRO::Config] Timed out waiting for election to complete or for host to join existing master
2014-06-09T14:25:47.648+01:00 [05472 error 'DAS' opID=D428CBEC-00001580-9b-1d] [VpxdDasConfigLRO::Config] EnableDAS failed on host [vim.HostSystem:host-1476,uk-mal-esx-p05.dyson.global.corp]: class Vim::Fault::Timedout::Exception(vim.fault.Timedout)

FDM log:

2014-06-09T10:58:35.777Z [FFC63B90 error 'Cluster' opID=SWI-46c45c9d] [ClusterDatastore::AcquireTraditionalDatastore] open(/vmfs/volumes/5118d934-a159136a-43cd-d48564c61fed/.vSphere-HA/FDM-1D88A749-CC95-4D5C-BF5D-3CE3B8A5075D-73-603131e-UK-MAL-VC-P01/protectedlist) failed: Device or resource busy
2014-06-09T10:58:35.777Z [FFADEB90 error 'Cluster' opID=SWI-3bb36853] [ClusterDatastore::AcquireTraditionalDatastore] open(/vmfs/volumes/5118d96e-7feaf4e4-1c30-d48564c61fed/.vSphere-HA/FDM-1D88A749-CC95-4D5C-BF5D-3CE3B8A5075D-73-603131e-UK-MAL-VC-P01/protectedlist) failed: Device or resource busy
2014-06-09T10:59:05.819Z [FFD67B90 error 'Cluster' opID=SWI-6c77b0d1] [ClusterDatastore::AcquireTraditionalDatastore] open(/vmfs/volumes/5118d96e-7feaf4e4-1c30-d48564c61fed/.vSphere-HA/FDM-1D88A749-CC95-4D5C-BF5D-3CE3B8A5075D-73-603131e-UK-MAL-VC-P01/protectedlist) failed: Device or resource busy


http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2017233
our action plan was to
Review SSL configuration and certificates in vCenter
disable the Denial-of-Service protection feature.
Review any security scan on your ESXi host via VMware HA agent port (port 8182)
Update NIC Adapter firmware to the latest on all the hosts since they were out of date

did the following but that didnt work too
1. Disable HA under Cluster settings
2. Ensure that SSL Certificate Checking is enabled.

For vCenter Server 5.1 and later:
In the vSphere Web Client, navigate to the vCenter Server instance.
Click the Manage tab.
Under Settings, click General.
Click Edit and select SSL settings.

3. Select vCenter requires verified host SSL certificates. If there are hosts that require manual validation, these hosts appear in the host list at the bottom of the dialog.
4. Click OK.
5. Click OK. Hosts that you have not selected are now disconnected.
6. Reconnect the host to vCenter Server.
7. Enable HA under Cluster setting

SSL certs have been validated – the certificates are valid and are issued from a template also used for ESX hosts which don’t have this issue.