This document gives you quick understanding and troubleshooting steps that can be performed in order to assess the source of the problem if you see "NFS all paths down" error message in vCenter to which Hyperflex cluster is integrated with.
How are HX Datastores Mounted on ESXI?
Hyperflex Datastores are mounted on the ESXI hosts as NFS mounts, in order to mount an NFS datastore we need the NFS Server IP which in our case is the eth1:0 virtual floating interface.
Hyperflex cluster leverages the use of virtualfloatingIP both for management (eth0:mgmtip) and storagedata (eth1:0) on which each IP will be assigned to one particular Storage Controller VM (StCtlVM). Please note they may end up in different StCtlVMs.
The importance of this is that the clusterstoragedataIP (eth1:0) is the one used to mount the datastore(s) created in the Hyperflexcluster. Thus it is essential to have it assigned and reachable from all the nodes of the cluster.
Please note that in case of failure of the StCtlVM that currently owns eth1:0 virtual IP, it should "migrate" to another available StCtlVM working in a similar way as an FHRP (First Hop Redundancy Protocol).
All Paths Down
APD means that the host cannot reach the storage and there is no Permanent Device Lost (PDL) SCSI code returned from the storage array.
As it does not know if the loss is temporary or not, it will keep trying to establish communication for more 140s by default (timeout) + 3min (Delay for failover) the ESXi Host begins to fail any non-virtual machine I/O traffic that is being sent to the storage device.
A typical error message in vCenter will be as follows.
Once you see APD alerts on your hosts, obtain the below information to better understand the problem description:
If one/several/all hosts impacted, and if some which particular hosts impacted
If any changes were performed previously (configuration/upgrade/etc)
The timestamp on when the problem first observed and if the issue is recurrent
In order to Troubleshoot APD we need to look into 3 components - vCenter, SCVM, and ESXi host.
These steps are a suggested workflow in order to pinpoint or narrow down the source of the All Paths Down symptom observed. Please note this order does not have to be meticulously followed and you may adequate it as per the particular symptoms observed on the customer environment.
Checks in vCenter Server:
Connect to vCenter Server (VCS) and navigate to an affectedhost
Related Objects -> Virtual Machines and confirm the StCtlVM is up and running
Related Objects -> Datastores and confirm if NFS datastores show "inaccessible". If datastores seem to be accessible and status you may try on Summary tab to "Reset to Green" the APD event and later verify if the alert pops back later
Monitor -> Issues and Monitor -> Events should provide information on when the APD was first spotted.
Checks in all StCtlVMs:
Connect to all the StCtlVMs and verify the below pointers, you may use MobaXterm software.
Verify if all StCtlVMs have the same time using date or ntpq –p. Time skew on StCtlVM may lead to issues with zookeeper database sync and thus it is paramount to have it in sync among all StCtlVMs.
The astrick sign infort of the ntp server denotes that the NTP of your SCVM is synced.
root@SpringpathControllerPZTMTRSH7K:~# date Tue May 28 12:47:27 PDT 2019
root@SpringpathControllerPZTMTRSH7K:~# ntpq -p -4 remote refid st t when poll reach delay offset jitter ============================================================================== *abcdefghij .GNSS. 1 u 429 1024 377 225.813 -1.436 0.176
If APD occurred during an upgrade you might consider to verify which StCtlVMs have notbeencompletelyupgraded and particularly identify the one that last failed. It is possible that it was the one holding the eth1:0 previously
Use dpkg -l | grep -i springpath to identify the StCtlVMs not completely upgraded as they will have mixed version springpath packages.
root@SpringpathControllerPZTMTRSH7K:~# dpkg -l | grep -i springpath ii storfs-appliance 4.0.1a-33028 amd64 Springpath Appliance ii storfs-asup 4.0.1a-33028 amd64 Springpath ASUP and SCH ii storfs-core 4.0.1a-33028 amd64 Springpath Distributed Filesystem ii storfs-fw 4.0.1a-33028 amd64 Springpath Appliance ii storfs-mgmt 4.0.1a-33028 amd64 Springpath Management Software ii storfs-mgmt-cli 4.0.1a-33028 amd64 Springpath Management Software ii storfs-mgmt-hypervcli 4.0.1a-33028 amd64 Springpath Management Software ii storfs-mgmt-ui 4.0.1a-33028 amd64 Springpath Management UI Module ii storfs-mgmt-vcplugin 4.0.1a-33028 amd64 Springpath Management UI and vCenter Plugin ii storfs-misc 4.0.1a-33028 amd64 Springpath Configuration ii storfs-pam 4.0.1a-33028 amd64 Springpath PAM related modules ii storfs-replication-services 4.0.1a-33028 amd64 Springpath Replication Services ii storfs-restapi 4.0.1a-33028 amd64 Springpath REST Api's ii storfs-robo 4.0.1a-33028 amd64 Springpath Appliance ii storfs-support 4.0.1a-33028 amd64 Springpath Support ii storfs-translations 4.0.1a-33028 amd64 Springpath Translations
Verify if all relevant services are runningservice_status.sh:
Some of the main services are Springpath File System (storfs), SCVM Client (scvmclient), System Management Service (stMgr) orCluster IP Monitor (cip-monitor).
root@SpringpathController5L0GTCR8SA:~# service_status.sh Springpath File System ... Running SCVM Client ... Running System Management Service ... Running HyperFlex Connect Server ... Running HyperFlex Platform Agnostic Service ... Running HyperFlex HyperV Service ... Not Running HyperFlex Connect WebSocket Server ... Running Platform Service ... Running Replication Services ... Running Data Service ... Running Cluster IP Monitor ... Running Replication Cluster IP Monitor ... Running Single Sign On Manager ... Running Stats Cache Service ... Running Stats Aggregator Service ... Running Stats Listener Service ... Running Cluster Manager Service ... Running Self Encrypting Drives Service ... Not Running Event Listener Service ... Running HX Device Connector ... Running Web Server ... Running Reverse Proxy Server ... Running Job Scheduler ... Running DNS and Name Server Service ... Running Stats Web Server ... Running
If any of these or other relevant service is not up, start it usingstart <serviceName>eg:start storfs
You may refer to the service_status.sh script to get the service names . Do a head -n25 /bin/service_status.sh and identify the service real name.
root@SpringpathController5L0GTCR8SA:~# head -n25 /bin/service_status.sh #!/bin/bash declare -a upstart_services=("Springpath File System:storfs"\ "SCVM Client:scvmclient"\ "System Management Service:stMgr"\ "HyperFlex Connect Server:hxmanager"\ "HyperFlex Platform Agnostic Service:hxSvcMgr"\ "HyperFlex HyperV Service:hxHyperVSvcMgr"\ "HyperFlex Connect WebSocket Server:zkupdates"\ "Platform Service:stNodeMgr"\ "Replication Services:replsvc"\ "Data Service:stDataSvcMgr"\ "Cluster IP Monitor:cip-monitor"\ "Replication Cluster IP Monitor:repl-cip-monitor"\ "Single Sign On Manager:stSSOMgr"\ "Stats Cache Service:carbon-cache"\ "Stats Aggregator Service:carbon-aggregator"\ "Stats Listener Service:statsd"\ "Cluster Manager Service:exhibitor"\ "Self Encrypting Drives Service:sedsvc"\ "Event Listener Service:storfsevents"\ "HX Device Connector:hx_device_connector"); declare -a other_services=("Web Server:tomcat8"\ "Reverse Proxy Server:nginx"\ "Job Scheduler:cron"\ "DNS and Name Server Service:resolvconf");
Identify whichStCtlVM contains the storageclusterIP (eth1:0) using ifconfig -a
If no StCtlVM contains that IP possibly the storfs is not running on one or more nodes.
root@help:~# ifconfig eth0:mgmtip Link encap:Ethernet HWaddr 00:50:56:8b:4c:90 inet addr:10.197.252.83 Bcast:10.197.252.95 Mask:255.255.255.224 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Verify if StCtlVM is in contact with CRMMaster and if zookeeperservice is up and running
echo srvr | nc localhost 2181 and check if mode is Leader, Follower or Standalone and if connections > 0
stcli cluster info | less or stcli cluster info | grep -i "active\|state\|unavailable" if trying to find what particular nodes appear with storage unavailable.
root@SpringpathControllerI51U7U6QZX:~# stcli cluster info | grep -i "active\|state\|unavailable" locale: English (United States) state: online upgradeState: ok healthState: healthy state: online state: 1 activeNodes: 3 state: online
stcli cluster storage-summary --detail to obtain the storage cluster details
root@SpringpathControllerI51U7U6QZX:~# stcli cluster storage-summary --detail address: 10.197.252.106 name: HX-Demo state: online uptime: 185 days 12 hours 48 minutes 42 seconds activeNodes: 3 of 3 compressionSavings: 85.45% deduplicationSavings: 0.0% freeCapacity: 4.9T healingInfo: inProgress: False resiliencyDetails: current ensemble size:3 # of caching failures before cluster shuts down:3 minimum cache copies remaining:3 minimum data copies available for some user data:3 minimum metadata copies available for cluster metadata:3 # of unavailable nodes:0 # of nodes failure tolerable for cluster to be available:1 health state reason:storage cluster is healthy. # of node failures before cluster shuts down:3 # of node failures before cluster goes into readonly:3 # of persistent devices failures tolerable for cluster to be available:2 # of node failures before cluster goes to enospace warn trying to move the existing data:na # of persistent devices failures before cluster shuts down:3 # of persistent devices failures before cluster goes into readonly:3 # of caching failures before cluster goes into readonly:na # of caching devices failures tolerable for cluster to be available:2 resiliencyInfo: messages: Storage cluster is healthy. state: 1 nodeFailuresTolerable: 1 cachingDeviceFailuresTolerable: 2 persistentDeviceFailuresTolerable: 2 zoneResInfoList: None spaceStatus: normal totalCapacity: 5.0T totalSavings: 85.45% usedCapacity: 85.3G zkHealth: online clusterAccessPolicy: lenient dataReplicationCompliance: compliant dataReplicationFactor: 3