Health Check

The following section provides information on GR setup health check.

  • All critical pods are in good condition to serve user traffic.

    Use the following command to check whether GR and CDL related pods are in Running state.

    kubectl get pods -n cn-cn1 -o wide | grep georeplication-pod
    kubectl get pods -n cn-cn1 -o wide | grep cdl
    kubectl get pods -n cn-cn1 -o wide | grep mirror-maker
  • Keepalived pods are in healthy state to monitor all VIPs which are configured for check-interface/check-port.

    Use the following command to check whether keepalived pods in “smi-vips” namespace are in “Running” state.

    kubectl get pods -n smi-vips

  • Health-check of pods related to CDL: Check the status of CDL db-endpoint, slot and indexes. All should be in STARTED or ONLINE state for both System IDs 1 and 2.

    cdl show status 
    message params: {cmd:status mode:cli dbName:session sessionIn:{mapId:0 limit:500 key: purgeOnEval:0 filters:[] nextEvalTsStart:0 nextEvalTsEnd:0 allReplicas:false maxDataSize:4096} sliceName:}
    db-endpoint {
        endpoint-site {
            system-id 1
            state STARTED
            total-sessions 4
            site-session-count 2
            total-reconciliation 0
            remote-connection-time 66h37m31.36054781s
            remote-connection-last-failure-time 2021-07-13 11:24:10.233825924 +0000 UTC
            slot-geo-replication-delay 2.025396ms
        }
        endpoint-site {
            system-id 2
            state STARTED
            total-sessions 4
            site-session-count 2
            total-reconciliation 0
            remote-connection-time 66h58m49.83449066s
            remote-connection-last-failure-time 2021-07-13 11:02:51.759971655 +0000 UTC
            slot-geo-replication-delay 1.561816ms
        }
    }
    slot {
        map {
            map-id 1
            instance {
                system-id 1
                instance-id 1
                records 4
                capacity 2500000
                state ONLINE
                avg-record-size-bytes 1
                up-time 89h38m37.335813523s
                sync-duration 9.298061ms
            }
            instance {
                system-id 1
                instance-id 2
                records 4
                capacity 2500000
                state ONLINE
                avg-record-size-bytes 1
                up-time 89h39m11.1268024s
                sync-duration 8.852556ms
            }
            instance {
                system-id 2
                instance-id 1
                records 4
                capacity 2500000
                state ONLINE
                avg-record-size-bytes 1
                up-time 89h28m38.274713022s
                sync-duration 8.37766ms
            }
            instance {
                system-id 2
                instance-id 2
                records 4
                capacity 2500000
                state ONLINE
                avg-record-size-bytes 1
                up-time 89h29m37.934345015s
                sync-duration 8.877442ms
            }
        }
    }
    index {
        map {
            map-id 1
            instance {
                system-id 1
                instance-id 1
                records 4
                capacity 60000000
                state ONLINE
                up-time 89h38m16.119032086s
                sync-duration 2.012281769s
                leader false
                geo-replication-delay 10.529821ms
            }
            instance {
                system-id 1
                instance-id 2
                records 4
                capacity 60000000
                state ONLINE
                up-time 89h39m8.47664588s
                sync-duration 2.011171261s
                leader true
                leader-time 89h38m53.761213379s
                geo-replication-delay 10.252683ms
            }
            instance {
                system-id 2
                instance-id 1
                records 4
                capacity 60000000
                state ONLINE
                up-time 89h28m29.5479133s
                sync-duration 2.012101957s
                leader false
                geo-replication-delay 15.974538ms
            }
            instance {
                system-id 2
                instance-id 2
                records 4
                capacity 60000000
                state ONLINE
                up-time 89h29m11.633496562s
                sync-duration 2.011566639s
                leader true
                leader-time 89h28m51.29928233s
                geo-replication-delay 16.213323ms
            }
        }
    }
  • CDL replication status

    Check whether four gRPC connections are established between the CDL EP session pods (of each namespace) across the racks in GRPC_Connections_to_RemoteSite panel of CDL Replication Stats Grafana dashboard. Check Grafana on both racks.

  • Admin port status between the racks for geo-replication.

    Check heartbeat messages between geo-replication pods across the racks in Periodic_Heartbeat_to_Remote_Site panel of GR Statistics Grafana dashboard.

  • BGP/BFD link status on rack

    Check whether neighborship with BGP peers is established in BGP Peers panel of BGP, BFD Statistics Grafan dashboard.

    Check whether BFD link is in connected state in BFD Link Status panel of BGP, BFD Statistics Grafana dashboard.

  • Roles of each instances are in healthy state

    Check that in each rack the roles are not in STANDBY_ERROR state at any point of time.

    • Active/Standby Model: Roles should be in the following states on each rack

      Rack-1:

      show role instance-id 1
      result "PRIMARY"
      show role instance-id 2
      result "PRIMARY"

      Rack-2:

      show role instance-id 1
      result "STANDBY"
      show role instance-id 2
      result "STANDBY"
    • Active/Active Model: Roles should be in the following states on each rack.

      Rack-1:

      show role instance-id 1
      result "PRIMARY"
      show role instance-id 2
      result "STANDBY"

      Rack-2:

      show role instance-id 1
      result "STANDBY"
      show role instance-id 2
      result "PRIMARY"