Health Check
The following section provides information on GR setup health check.
-
All critical pods are in good condition to serve user traffic.
Use the following command to check whether GR and CDL related pods are in Running state.
kubectl get pods -n cn-cn1 -o wide | grep georeplication-pod kubectl get pods -n cn-cn1 -o wide | grep cdl kubectl get pods -n cn-cn1 -o wide | grep mirror-maker
-
Keepalived pods are in healthy state to monitor all VIPs which are configured for check-interface/check-port.
Use the following command to check whether keepalived pods in “smi-vips” namespace are in “Running” state.
kubectl get pods -n smi-vips
-
Health-check of pods related to CDL: Check the status of CDL db-endpoint, slot and indexes. All should be in STARTED or ONLINE state for both System IDs 1 and 2.
cdl show status message params: {cmd:status mode:cli dbName:session sessionIn:{mapId:0 limit:500 key: purgeOnEval:0 filters:[] nextEvalTsStart:0 nextEvalTsEnd:0 allReplicas:false maxDataSize:4096} sliceName:} db-endpoint { endpoint-site { system-id 1 state STARTED total-sessions 4 site-session-count 2 total-reconciliation 0 remote-connection-time 66h37m31.36054781s remote-connection-last-failure-time 2021-07-13 11:24:10.233825924 +0000 UTC slot-geo-replication-delay 2.025396ms } endpoint-site { system-id 2 state STARTED total-sessions 4 site-session-count 2 total-reconciliation 0 remote-connection-time 66h58m49.83449066s remote-connection-last-failure-time 2021-07-13 11:02:51.759971655 +0000 UTC slot-geo-replication-delay 1.561816ms } } slot { map { map-id 1 instance { system-id 1 instance-id 1 records 4 capacity 2500000 state ONLINE avg-record-size-bytes 1 up-time 89h38m37.335813523s sync-duration 9.298061ms } instance { system-id 1 instance-id 2 records 4 capacity 2500000 state ONLINE avg-record-size-bytes 1 up-time 89h39m11.1268024s sync-duration 8.852556ms } instance { system-id 2 instance-id 1 records 4 capacity 2500000 state ONLINE avg-record-size-bytes 1 up-time 89h28m38.274713022s sync-duration 8.37766ms } instance { system-id 2 instance-id 2 records 4 capacity 2500000 state ONLINE avg-record-size-bytes 1 up-time 89h29m37.934345015s sync-duration 8.877442ms } } } index { map { map-id 1 instance { system-id 1 instance-id 1 records 4 capacity 60000000 state ONLINE up-time 89h38m16.119032086s sync-duration 2.012281769s leader false geo-replication-delay 10.529821ms } instance { system-id 1 instance-id 2 records 4 capacity 60000000 state ONLINE up-time 89h39m8.47664588s sync-duration 2.011171261s leader true leader-time 89h38m53.761213379s geo-replication-delay 10.252683ms } instance { system-id 2 instance-id 1 records 4 capacity 60000000 state ONLINE up-time 89h28m29.5479133s sync-duration 2.012101957s leader false geo-replication-delay 15.974538ms } instance { system-id 2 instance-id 2 records 4 capacity 60000000 state ONLINE up-time 89h29m11.633496562s sync-duration 2.011566639s leader true leader-time 89h28m51.29928233s geo-replication-delay 16.213323ms } } }
-
CDL replication status
Check whether four gRPC connections are established between the CDL EP session pods (of each namespace) across the racks in GRPC_Connections_to_RemoteSite panel of CDL Replication Stats Grafana dashboard. Check Grafana on both racks.
-
Admin port status between the racks for geo-replication.
Check heartbeat messages between geo-replication pods across the racks in Periodic_Heartbeat_to_Remote_Site panel of GR Statistics Grafana dashboard.
BGP/BFD link status on rack
Check whether neighborship with BGP peers is established in BGP Peers panel of BGP, BFD Statistics Grafan dashboard.
Check whether BFD link is in connected state in BFD Link Status panel of BGP, BFD Statistics Grafana dashboard.
-
Roles of each instances are in healthy state
Check that in each rack the roles are not in STANDBY_ERROR state at any point of time.
-
Active/Standby Model: Roles should be in the following states on each rack
Rack-1:
show role instance-id 1 result "PRIMARY" show role instance-id 2 result "PRIMARY"
Rack-2:
show role instance-id 1 result "STANDBY" show role instance-id 2 result "STANDBY"
-
Active/Active Model: Roles should be in the following states on each rack.
Rack-1:
show role instance-id 1 result "PRIMARY" show role instance-id 2 result "STANDBY"
Rack-2:
show role instance-id 1 result "STANDBY" show role instance-id 2 result "PRIMARY"
-