ECS Server HA Failure Domain Testing

1. Introduction to the test environment
2. Basic Concept
3. Case #1. An ECS Server is unreachable
4. Case #2. An ECS Agent is unreachable
5. Conclusion

1. Introduction to the test environment

CDP Runtime version	CDP PvC Base 7.1.7 SP1
CM version	Cloudera Manager 7.8.1
ECS version	CDP PvC DataServices 1.4.1
OS version	Centos 7.9
K8S version	RKE 1.21
Whether to enable Kerberos	Yes
Whether to enable TLS	Yes
Auto-TLS	Yes
Kerberos	FreeIPA
LDAP	FreeIPA
DB Configuration	External Postgres 12
Vault	Embedded
Docker registry	Embedded
Install Method	Internet

IP addresss	hostname	description
192.168.8.141	ds01.ecs.openstack.com	ECS master node 1
192.168.8.142	ds02.ecs.openstack.com	ECS master node 2
192.168.8.143	ds03.ecs.openstack.com	ECS master node 3
192.168.8.144	ds04.ecs.openstack.com	ECS worker node 1
192.168.8.145	ds05.ecs.openstack.com	ECS worker node 2
192.168.8.146	ds06.ecs.openstack.com	ECS worker node 3

For ECS cluster install/config guide, see Fresh install of ECS 1.3.4 HA Cluster.

2. Basic Concept

2.1 Pod Eviction on the NotReady Node

kube-controller-manager checks the node status periodically. Whenever the node status is NotReady and the podEvictionTimeout time is exceeded, all pods on the node will be expelled to other nodes. The specific expulsion speed is also affected by expulsion speed parameters, cluster size, and so on. See kubernetes pod evictions.

2.2 Pod Deletion on the NotReady Node

A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. See delete pods.
The only ways in which a Pod in such a state can be removed from the apiserver are as follows:
- The Node object is deleted (either by you, or by the Node Controller).
- The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
- Force deletion of the Pod by the user.
The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.

2.3 Safely Drain a Node before you bring down the node

You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified. See use kubectl drain to remove a node from service.
When kubectl drain returns successfully, that indicates that all of the pods (except the ones excluded as described in the previous paragraph) have been safely evicted (respecting the desired graceful termination period, and respecting the PodDisruptionBudget you have defined). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.

2.4 Cronjob pod-reaper introduced in PvC 1.4.1

CronJob pod-reaper launched reaper job every 10 minutes. The reaper job will scan all namespaces and force deleting pods that have been in the terminating state for more than 10 minutes.

3. Case #1. An ECS Server is unreachable

Check all ECS nodes

$ kubectl get node
NAME                                    STATUS   ROLES                       AGE    VERSION
ds01.ecs.openstack.com   Ready    control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds02.ecs.openstack.com   Ready    control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds03.ecs.openstack.com   Ready    control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds04.ecs.openstack.com   Ready    <none>                      2d4h   v1.21.14+rke2r1
ds05.ecs.openstack.com   Ready    <none>                      2d4h   v1.21.14+rke2r1
ds06.ecs.openstack.com   Ready    <none>                      2d4h   v1.21.14+rke2r1

Check the state of ECS roles on CM UI

Check all Kubernetes objects

Check all pods running on ds02

$ kubectl get pods -A -o wide --field-selector spec.nodeName=ds02.ecs.openstack.com
NAMESPACE                            NAME                                                              READY   STATUS      RESTARTS   AGE     IP               NODE                                    NOMINATED NODE   READINESS GATES
cdp                                  cdp-release-alert-admin-service-68f5c6dd7c-glt9k                  2/2     Running     0          2d4h    10.42.1.17       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-classic-clusters-7fdf69b6b-psmkj                      3/3     Running     0          2d4h    10.42.1.24       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-cluster-access-manager-b58448f45-hqpnq                2/2     Running     0          2d4h    10.42.1.26       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-dmx-559ccfb675-qjqb5                                  3/3     Running     0          2d4h    10.42.1.21       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-logger-alert-receiver-7cc7647d57-r78s4                2/2     Running     0          2d4h    10.42.1.19       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-metrics-server-exporter-cfff8b89f-tm7gs               2/2     Running     0          2d4h    10.42.1.18       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-thunderhead-de-api-675b74d89c-hl4h7                   2/2     Running     0          2d4h    10.42.1.15       ds02.ecs.openstack.com   <none>           <none>
cml01                                api-648d7bc885-xvv92                                              1/1     Running     1          105m    10.42.1.48       ds02.ecs.openstack.com   <none>           <none>
cml01                                cron-86888f4849-gn2kw                                             2/2     Running     0          105m    10.42.1.44       ds02.ecs.openstack.com   <none>           <none>
cml01                                ds-operator-5d79bf6756-k78kk                                      2/2     Running     0          105m    10.42.1.46       ds02.ecs.openstack.com   <none>           <none>
cml01                                ds-reconciler-5bfb4bb4fc-qgbrz                                    2/2     Running     0          105m    10.42.1.45       ds02.ecs.openstack.com   <none>           <none>
cml01                                livelog-0                                                         2/2     Running     0          104m    10.42.1.49       ds02.ecs.openstack.com   <none>           <none>
cml01                                livelog-publisher-5wjw5                                           2/2     Running     2          2d2h    10.42.1.34       ds02.ecs.openstack.com   <none>           <none>
cml01                                model-metrics-7654cf6945-7vsbm                                    1/1     Running     0          105m    10.42.1.47       ds02.ecs.openstack.com   <none>           <none>
cml01                                model-proxy-dd4d49678-tqj4h                                       2/2     Running     0          105m    10.42.1.43       ds02.ecs.openstack.com   <none>           <none>
cml01                                s2i-queue-0                                                       2/2     Running     0          2d2h    10.42.1.35       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       hiveserver2-0                                                     1/1     Running     0          2m38s   10.42.1.53       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       huebackend-0                                                      1/1     Running     0          2m38s   10.42.1.52       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       huefrontend-5b67b68994-vqcp8                                      1/1     Running     0          2m38s   10.42.1.51       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       standalone-compute-operator-0                                     1/1     Running     0          2m39s   10.42.1.50       ds02.ecs.openstack.com   <none>           <none>
ecs-webhooks                         ecs-tolerations-webhook-7d9454f4f9-549qj                          1/1     Running     0          2d4h    10.42.1.3        ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      catalogd-0                                                        2/2     Running     1          30h     10.42.1.39       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      huebackend-0                                                      2/2     Running     0          30h     10.42.1.37       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      huefrontend-7788ff9f45-8kvhj                                      1/1     Running     0          30h     10.42.1.36       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      impala-autoscaler-66cfbb7b7f-clsfg                                1/1     Running     0          30h     10.42.1.38       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      statestored-8bcbc5657-mw7kg                                       1/1     Running     0          30h     10.42.1.41       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      usage-monitor-645b96b78f-fvvvp                                    1/1     Running     0          30h     10.42.1.40       ds02.ecs.openstack.com   <none>           <none>
infra-prometheus                     infra-prometheus-operator-1-1672115985-prometheus-node-exp49m48   1/1     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          etcd-ds02.ecs.openstack.com                        1/1     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          helm-install-rke2-canal-xt5w6                                     0/1     Completed   0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-apiserver-ds02.ecs.openstack.com              1/1     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-controller-manager-ds02.ecs.openstack.com     1/1     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-proxy-ds02.ecs.openstack.com                  1/1     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-scheduler-ds02.ecs.openstack.com              1/1     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          nvidia-device-plugin-daemonset-vfvm9                              1/1     Running     0          2d4h    10.42.1.14       ds02.ecs.openstack.com   <none>           <none>
kube-system                          rke2-canal-jgslr                                                  2/2     Running     0          2d4h    192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          rke2-coredns-rke2-coredns-6775f768c8-82tkq                        1/1     Running     0          2d4h    10.42.1.2        ds02.ecs.openstack.com   <none>           <none>
kube-system                          rke2-ingress-nginx-controller-84fb589c78-bgkp5                    1/1     Running     0          2d4h    10.42.1.13       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      engine-image-ei-045573ad-thvcf                                    1/1     Running     0          2d4h    10.42.1.11       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      helm-install-longhorn-z7hr8                                       0/1     Completed   0          2d4h    10.42.1.4        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      instance-manager-e-b4025e6d                                       1/1     Running     0          2d4h    10.42.1.10       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      instance-manager-r-dfa1437e                                       1/1     Running     0          2d4h    10.42.1.9        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-admission-webhook-584567fd57-sfmmg                       1/1     Running     0          2d4h    10.42.1.8        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-conversion-webhook-74bdb64887-mq45q                      1/1     Running     0          2d4h    10.42.1.5        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-csi-plugin-p24j8                                         2/2     Running     0          2d4h    10.42.1.12       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-manager-h2vb2                                            1/1     Running     0          2d4h    10.42.1.6        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-ui-6f94dfc58c-jw7r8                                      1/1     Running     0          2d4h    10.42.1.7        ds02.ecs.openstack.com   <none>           <none>
warehouse-default-datalake-default   metastore-0                                                       1/1     Running     0          2d3h    10.42.1.27       ds02.ecs.openstack.com   <none>           <none>

Shutdown node ds02

Confirm that only ds02 is NotReady

$ kubectl get node
NAME                                    STATUS     ROLES                       AGE    VERSION
ds01.ecs.openstack.com   Ready      control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds02.ecs.openstack.com   NotReady   control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds03.ecs.openstack.com   Ready      control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds04.ecs.openstack.com   Ready      <none>                      2d4h   v1.21.14+rke2r1
ds05.ecs.openstack.com   Ready      <none>                      2d4h   v1.21.14+rke2r1
ds06.ecs.openstack.com   Ready      <none>                      2d4h   v1.21.14+rke2r1

ECS server (ds02, ds03) stderr.log indicated Stopped tunnel to 192.168.8.142.

time="2022-12-29T17:13:37+08:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2022-12-29T17:13:37+08:00" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2022-12-29T17:13:42+08:00" level=info msg="Connecting to proxy" url="wss://192.168.8.142:9345/v1-rke2/connect"
time="2022-12-29T17:13:53+08:00" level=info msg="Stopped tunnel to 192.168.8.142:9345"
time="2022-12-29T17:14:47+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.8.142:9345: connect: no route to host"
time="2022-12-29T17:14:47+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.8.142:9345: connect: no route to host"

ECS Agent (ds04, ds05, ds06 ) stderr.log indicated Updating load balancer server addresses and Stopped tunnel to 192.168.8.142.

time="2022-12-29T17:13:37+08:00" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2022-12-29T17:13:42+08:00" level=info msg="Connecting to proxy" url="wss://192.168.8.142:9345/v1-rke2/connect"
time="2022-12-29T17:14:08+08:00" level=error msg="Tunnel endpoint watch channel closed: {ERROR &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding,Reason:InternalError,Details:&StatusDetails{Name:,Group:,Kind:,Causes:[]StatusCause{StatusCause{Type:UnexpectedServerResponse,Message:unable to decode an event from the watch stream: http2: client connection lost,Field:,},StatusCause{Type:ClientWatchDecoding,Message:unable to decode an event from the watch stream: http2: client connection lost,Field:,},},RetryAfterSeconds:0,UID:,},Code:500,}}"
time="2022-12-29T17:14:13+08:00" level=info msg="Updating load balancer rke2-api-server-agent-load-balancer server addresses -> [192.168.8.141:6443 192.168.8.143:6443 192.168.8.142:6443]"
time="2022-12-29T17:14:13+08:00" level=info msg="Updating load balancer rke2-agent-load-balancer server addresses -> [192.168.8.141:9345 192.168.8.143:9345]"
time="2022-12-29T17:14:13+08:00" level=info msg="Stopped tunnel to 192.168.8.142:9345"
time="2022-12-29T17:14:45+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.8.142:9345: connect: no route to host"
time="2022-12-29T17:14:45+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.8.142:9345: connect: no route to host"

As you can see from the HAProxy UI, all ports on ds02 ( http port 80, https port 434) have become Down state

As you can see from the CM UI, ECS server health/Control Plane Health/Kubernetes Health/Longhorn Health start to alarm.

You can also see many pod failures on the k8s web UI.

Most of the pods from ds01 are stuck in terminating state after 300 seconds.

$ kubectl get pods -A -o wide --field-selector spec.nodeName=ds02.ecs.openstack.com
NAMESPACE                            NAME                                                              READY   STATUS        RESTARTS   AGE    IP               NODE                                    NOMINATED NODE   READINESS GATES
cdp                                  cdp-release-alert-admin-service-68f5c6dd7c-glt9k                  2/2     Terminating   0          2d4h   10.42.1.17       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-classic-clusters-7fdf69b6b-psmkj                      3/3     Terminating   0          2d4h   10.42.1.24       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-cluster-access-manager-b58448f45-hqpnq                2/2     Terminating   0          2d4h   10.42.1.26       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-dmx-559ccfb675-qjqb5                                  3/3     Terminating   0          2d4h   10.42.1.21       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-logger-alert-receiver-7cc7647d57-r78s4                2/2     Terminating   0          2d4h   10.42.1.19       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-metrics-server-exporter-cfff8b89f-tm7gs               2/2     Terminating   0          2d4h   10.42.1.18       ds02.ecs.openstack.com   <none>           <none>
cdp                                  cdp-release-thunderhead-de-api-675b74d89c-hl4h7                   2/2     Terminating   0          2d4h   10.42.1.15       ds02.ecs.openstack.com   <none>           <none>
cml01                                api-648d7bc885-xvv92                                              1/1     Terminating   1          122m   10.42.1.48       ds02.ecs.openstack.com   <none>           <none>
cml01                                cron-86888f4849-gn2kw                                             2/2     Terminating   0          122m   10.42.1.44       ds02.ecs.openstack.com   <none>           <none>
cml01                                ds-operator-5d79bf6756-k78kk                                      2/2     Terminating   0          122m   10.42.1.46       ds02.ecs.openstack.com   <none>           <none>
cml01                                ds-reconciler-5bfb4bb4fc-qgbrz                                    2/2     Terminating   0          122m   10.42.1.45       ds02.ecs.openstack.com   <none>           <none>
cml01                                livelog-publisher-5wjw5                                           2/2     Running       2          2d3h   10.42.1.34       ds02.ecs.openstack.com   <none>           <none>
cml01                                model-metrics-7654cf6945-7vsbm                                    1/1     Terminating   0          122m   10.42.1.47       ds02.ecs.openstack.com   <none>           <none>
cml01                                model-proxy-dd4d49678-tqj4h                                       2/2     Terminating   0          122m   10.42.1.43       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       hiveserver2-0                                                     1/1     Terminating   0          19m    10.42.1.53       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       huebackend-0                                                      1/1     Terminating   0          19m    10.42.1.52       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       huefrontend-5b67b68994-vqcp8                                      1/1     Terminating   0          19m    10.42.1.51       ds02.ecs.openstack.com   <none>           <none>
compute-hive01                       standalone-compute-operator-0                                     1/1     Terminating   0          19m    10.42.1.50       ds02.ecs.openstack.com   <none>           <none>
ecs-webhooks                         ecs-tolerations-webhook-7d9454f4f9-549qj                          1/1     Terminating   0          2d4h   10.42.1.3        ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      catalogd-0                                                        2/2     Terminating   1          30h    10.42.1.39       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      huebackend-0                                                      2/2     Terminating   0          30h    10.42.1.37       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      huefrontend-7788ff9f45-8kvhj                                      1/1     Terminating   0          30h    10.42.1.36       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      impala-autoscaler-66cfbb7b7f-clsfg                                1/1     Terminating   0          30h    10.42.1.38       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      statestored-8bcbc5657-mw7kg                                       1/1     Terminating   0          30h    10.42.1.41       ds02.ecs.openstack.com   <none>           <none>
impala-impala01                      usage-monitor-645b96b78f-fvvvp                                    1/1     Terminating   0          30h    10.42.1.40       ds02.ecs.openstack.com   <none>           <none>
infra-prometheus                     infra-prometheus-operator-1-1672115985-prometheus-node-exp49m48   1/1     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          etcd-ds02.ecs.openstack.com                        1/1     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-apiserver-ds02.ecs.openstack.com              1/1     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-controller-manager-ds02.ecs.openstack.com     1/1     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-proxy-ds02.ecs.openstack.com                  1/1     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          kube-scheduler-ds02.ecs.openstack.com              1/1     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          nvidia-device-plugin-daemonset-vfvm9                              1/1     Running       0          2d4h   10.42.1.14       ds02.ecs.openstack.com   <none>           <none>
kube-system                          rke2-canal-jgslr                                                  2/2     Running       0          2d4h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system                          rke2-coredns-rke2-coredns-6775f768c8-82tkq                        1/1     Terminating   0          2d4h   10.42.1.2        ds02.ecs.openstack.com   <none>           <none>
kube-system                          rke2-ingress-nginx-controller-84fb589c78-bgkp5                    1/1     Terminating   0          2d4h   10.42.1.13       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      engine-image-ei-045573ad-thvcf                                    1/1     Running       0          2d4h   10.42.1.11       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      instance-manager-e-b4025e6d                                       1/1     Terminating   0          2d4h   10.42.1.10       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      instance-manager-r-dfa1437e                                       1/1     Terminating   0          2d4h   10.42.1.9        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-admission-webhook-584567fd57-sfmmg                       1/1     Terminating   0          2d4h   10.42.1.8        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-conversion-webhook-74bdb64887-mq45q                      1/1     Terminating   0          2d4h   10.42.1.5        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-csi-plugin-p24j8                                         2/2     Running       0          2d4h   10.42.1.12       ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-manager-h2vb2                                            1/1     Running       0          2d4h   10.42.1.6        ds02.ecs.openstack.com   <none>           <none>
longhorn-system                      longhorn-ui-6f94dfc58c-jw7r8                                      1/1     Terminating   0          2d4h   10.42.1.7        ds02.ecs.openstack.com   <none>           <none>
warehouse-default-datalake-default   metastore-0                                                       1/1     Terminating   0          2d3h   10.42.1.27       ds02.ecs.openstack.com   <none>           <none>

CDW cluster is abnormal with warnings “Hive server service is not ready. Service endpoint may not be reachable! Error Code : undefined”.

Job pod-reaper forced deleting pods in terminating state after 25 minutes of service interruption.

$ kubectl logs -n pod-reaper pod-reaper-27871780-tpwzm
Thu Dec 29 09:40:01 UTC 2022 Starting pod-reaper [Reap older than: 10 minute(s)][Namespace regex: *UNKNOWN*]
Thu Dec 29 09:40:01 UTC 2022 processing namespace cml01
Thu Dec 29 09:40:01 UTC 2022 processing namespace cdp
Thu Dec 29 09:40:01 UTC 2022 processing namespace cml01-user-1
Thu Dec 29 09:40:01 UTC 2022 processing namespace compute-hive01
Thu Dec 29 09:40:01 UTC 2022 processing namespace default
Thu Dec 29 09:40:01 UTC 2022 processing namespace impala-impala01
Thu Dec 29 09:40:01 UTC 2022 processing namespace default-ad522d9e-log-router
Thu Dec 29 09:40:01 UTC 2022 processing namespace default-ad522d9e-monitoring-platform
Thu Dec 29 09:40:01 UTC 2022 processing namespace ecs-webhooks
Thu Dec 29 09:40:01 UTC 2022 processing namespace infra-prometheus
Thu Dec 29 09:40:02 UTC 2022 Force delete pod ecs-tolerations-webhook-7d9454f4f9-549qj in namespace ecs-webhooks with deletion timestamp: 2022-12-29T09:20:25Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Thu Dec 29 09:40:02 UTC 2022 Force delete pod hiveserver2-0 in namespace compute-hive01 with deletion timestamp: 2022-12-29T09:20:25Z
pod "ecs-tolerations-webhook-7d9454f4f9-549qj" force deleted
Thu Dec 29 09:40:02 UTC 2022 Force delete pod catalogd-0 in namespace impala-impala01 with deletion timestamp: 2022-12-29T09:20:25Z
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod ecs-tolerations-webhook-7d9454f4f9-549qj in namespace ecs-webhooks
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "hiveserver2-0" force deleted
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod hiveserver2-0 in namespace compute-hive01
Thu Dec 29 09:40:02 UTC 2022 Force delete pod huebackend-0 in namespace compute-hive01 with deletion timestamp: 2022-12-29T09:20:25Z
pod "catalogd-0" force deleted
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod catalogd-0 in namespace impala-impala01
Thu Dec 29 09:40:02 UTC 2022 Force delete pod api-648d7bc885-xvv92 in namespace cml01 with deletion timestamp: 2022-12-29T09:20:25Z
Thu Dec 29 09:40:02 UTC 2022 Force delete pod huebackend-0 in namespace impala-impala01 with deletion timestamp: 2022-12-29T09:20:25Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Thu Dec 29 09:40:02 UTC 2022 Force delete pod cdp-release-alert-admin-service-68f5c6dd7c-glt9k in namespace cdp with deletion timestamp: 2022-12-29T09:20:25Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "huebackend-0" force deleted
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod huebackend-0 in namespace compute-hive01
......

The remaining 13 pods on ds02 are basically daemonset type. Pod rke2-ingress-nginx-controller is an exception.

$ kubectl get pods -A -o wide --field-selector spec.nodeName=ds02.ecs.openstack.com
NAMESPACE          NAME                                                              READY   STATUS    RESTARTS   AGE    IP               NODE                                    NOMINATED NODE   READINESS GATES
cml01              livelog-publisher-5wjw5                                           2/2     Running   2          2d3h   10.42.1.34       ds02.ecs.openstack.com   <none>           <none>
infra-prometheus   infra-prometheus-operator-1-1672115985-prometheus-node-exp49m48   1/1     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        etcd-ds02.ecs.openstack.com                        1/1     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        kube-apiserver-ds02.ecs.openstack.com              1/1     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        kube-controller-manager-ds02.ecs.openstack.com     1/1     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        kube-proxy-ds02.ecs.openstack.com                  1/1     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        kube-scheduler-ds02.ecs.openstack.com              1/1     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        nvidia-device-plugin-daemonset-vfvm9                              1/1     Running   0          2d5h   10.42.1.14       ds02.ecs.openstack.com   <none>           <none>
kube-system        rke2-canal-jgslr                                                  2/2     Running   0          2d5h   192.168.8.142   ds02.ecs.openstack.com   <none>           <none>
kube-system        rke2-ingress-nginx-controller-84fb589c78-lqxrh                    0/1     Pending   0          24m    <none>           ds02.ecs.openstack.com   <none>           <none>
longhorn-system    engine-image-ei-045573ad-thvcf                                    1/1     Running   0          2d5h   10.42.1.11       ds02.ecs.openstack.com   <none>           <none>
longhorn-system    longhorn-csi-plugin-p24j8                                         2/2     Running   0          2d5h   10.42.1.12       ds02.ecs.openstack.com   <none>           <none>
longhorn-system    longhorn-manager-h2vb2                                            1/1     Running   0          2d5h   10.42.1.6        ds02.ecs.openstack.com   <none>           <none>

We also see the same result on k8s web UI, only 13 failed pods. But these abnormal pods will not affect existing applications at all.

Confirm that CP/CDW/CML works well

4. Case #2. An ECS Agent is unreachable

Check all ECS nodes

$ kubectl get node
NAME                                    STATUS   ROLES                       AGE    VERSION
ds01.ecs.openstack.com   Ready    control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds02.ecs.openstack.com   Ready    control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds03.ecs.openstack.com   Ready    control-plane,etcd,master   2d4h   v1.21.14+rke2r1
ds04.ecs.openstack.com   Ready    <none>                      2d4h   v1.21.14+rke2r1
ds05.ecs.openstack.com   Ready    <none>                      2d4h   v1.21.14+rke2r1
ds06.ecs.openstack.com   Ready    <none>                      2d4h   v1.21.14+rke2r1

Check the state of ECS roles on CM UI

Check all Kubernetes objects

Check all pods running on ds04

$ kubectl get pods -A -o wide --field-selector spec.nodeName=ds04.ecs.openstack.com
NAMESPACE                              NAME                                                              READY   STATUS      RESTARTS   AGE    IP               NODE                                    NOMINATED NODE   READINESS GATES
cdp                                    cdp-release-dps-gateway-1.0-595cdcfcd6-kvp7f                      3/3     Running     0          2d7h   10.42.5.18       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-grafana-6f577d4bdb-4ll59                              3/3     Running     0          2d5h   10.42.5.46       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-monitoring-app-6c9794967-jcdrf                        2/2     Running     0          2d6h   10.42.5.32       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-monitoring-metricproxy-6c46bb46d8-jgq62               2/2     Running     0          2d6h   10.42.5.33       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-monitoring-pvcservice-5cdb55c9c5-rnp7g                2/2     Running     0          2d6h   10.42.5.31       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-thunderhead-kerberosmgmt-api-7d484b8fd6-j9vb2         2/2     Running     0          2d7h   10.42.5.21       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-thunderhead-sdx2-api-6f9989c764-8cqsj                 2/2     Running     0          2d7h   10.42.5.19       ds04.ecs.openstack.com   <none>           <none>
cdp                                    echoserver-9dfbcd89b-h6z4n                                        1/1     Running     0          2d7h   10.42.5.17       ds04.ecs.openstack.com   <none>           <none>
cml01                                  livelog-publisher-989hx                                           2/2     Running     2          2d5h   10.42.5.45       ds04.ecs.openstack.com   <none>           <none>
cml01                                  mlx-cml01-pod-evaluator-6445d6946b-txlrq                          1/1     Running     0          2d5h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
cml01                                  model-metrics-db-0                                                1/1     Running     0          2d5h   10.42.5.47       ds04.ecs.openstack.com   <none>           <none>
cml01                                  s2i-builder-58d875f44c-4zn4z                                      2/2     Running     6          2d5h   10.42.5.48       ds04.ecs.openstack.com   <none>           <none>
cml01                                  s2i-builder-58d875f44c-ftqhl                                      2/2     Running     6          2d5h   10.42.5.49       ds04.ecs.openstack.com   <none>           <none>
cml01                                  s2i-builder-58d875f44c-kqkrk                                      2/2     Running     6          2d5h   10.42.5.50       ds04.ecs.openstack.com   <none>           <none>
compute-hive01                         huebackend-0                                                      1/1     Running     0          124m   10.42.5.55       ds04.ecs.openstack.com   <none>           <none>
default-ad522d9e-monitoring-platform   monitoring-logger-alert-receiver-7485d89576-vg6n4                 2/2     Running     0          2d6h   10.42.5.34       ds04.ecs.openstack.com   <none>           <none>
default-ad522d9e-monitoring-platform   monitoring-metrics-server-exporter-7f977ddff5-ht2n6               2/2     Running     0          2d6h   10.42.5.35       ds04.ecs.openstack.com   <none>           <none>
default-ad522d9e-monitoring-platform   monitoring-prometheus-server-664745446f-pw6jc                     3/3     Running     0          2d6h   10.42.5.37       ds04.ecs.openstack.com   <none>           <none>
impala-impala01                        impala-executor-000-0                                             1/1     Running     0          32h    10.42.5.54       ds04.ecs.openstack.com   <none>           <none>
infra-prometheus                       infra-prometheus-operator-1-1672115985-prometheus-node-expzhqf7   1/1     Running     0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
infra-prometheus                       infra-prometheus-operator-operator-854bdc78b6-2whcf               1/1     Running     0          2d7h   10.42.5.15       ds04.ecs.openstack.com   <none>           <none>
kube-system                            kube-proxy-ds04.ecs.openstack.com                  1/1     Running     0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
kube-system                            nvidia-device-plugin-daemonset-j278f                              1/1     Running     0          2d7h   10.42.5.16       ds04.ecs.openstack.com   <none>           <none>
kube-system                            rke2-canal-kmtnv                                                  2/2     Running     0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-attacher-559b9bc796-2wv7w                                     1/1     Running     1          2d7h   10.42.5.6        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-provisioner-d7df997cf-g7wbj                                   1/1     Running     0          2d7h   10.42.5.8        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-resizer-9db78b867-zkb9k                                       1/1     Running     0          2d7h   10.42.5.7        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-snapshotter-74d97b97bf-nlqk2                                  1/1     Running     1          2d7h   10.42.5.9        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        engine-image-ei-045573ad-knqc5                                    1/1     Running     0          2d7h   10.42.5.2        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        instance-manager-e-3cf7ad0b                                       1/1     Running     0          2d7h   10.42.5.4        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        instance-manager-r-bf10b574                                       1/1     Running     0          2d7h   10.42.5.5        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        longhorn-csi-plugin-qxwk6                                         2/2     Running     0          2d7h   10.42.5.10       ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        longhorn-manager-mbs49                                            1/1     Running     0          2d7h   10.42.5.3        ds04.ecs.openstack.com   <none>           <none>
pod-reaper                             pod-reaper-27871790-rcg96                                         0/1     Completed   0          114m   10.42.5.57       ds04.ecs.openstack.com   <none>           <none>
pod-reaper                             pod-reaper-27871810-4774b                                         0/1     Completed   0          94m    10.42.5.58       ds04.ecs.openstack.com   <none>           <none>
vault-system                           vault-0                                                           1/1     Running     0          2d7h   10.42.5.11       ds04.ecs.openstack.com   <none>           <none>
warehouse-default-datalake-default     hue-query-processor-0                                             1/1     Running     0          2d5h   10.42.5.44       ds04.ecs.openstack.com   <none>           <none>
warehouse-default-datalake-default     metastore-0                                                       1/1     Running     0          124m   10.42.5.56       ds04.ecs.openstack.com   <none>           <none>

Shutdown node ds04

Confirm that only ds04 is NotReady

$ kubectl get node
NAME                                    STATUS     ROLES                       AGE    VERSION
ds01.ecs.openstack.com   Ready      control-plane,etcd,master   2d7h   v1.21.14+rke2r1
ds02.ecs.openstack.com   Ready      control-plane,etcd,master   2d7h   v1.21.14+rke2r1
ds03.ecs.openstack.com   Ready      control-plane,etcd,master   2d7h   v1.21.14+rke2r1
ds04.ecs.openstack.com   NotReady   <none>                      2d7h   v1.21.14+rke2r1
ds05.ecs.openstack.com   Ready      <none>                      2d7h   v1.21.14+rke2r1
ds06.ecs.openstack.com   Ready      <none>                      2d7h   v1.21.14+rke2r1

As you can see from the HAProxy UI, all ECS server ports ( http port 80, https port 434) are green.

As you can see from the CM UI, ECS server health/Control Plane Health/Kubernetes Health/Longhorn Health start to alarm.

You can also see many pod failures on the k8s web UI.

Most of the pods from ds04 are stuck in terminating state after 300 seconds.

$ kubectl get pods -A -o wide --field-selector spec.nodeName=ds04.ecs.openstack.com
NAMESPACE                              NAME                                                              READY   STATUS        RESTARTS   AGE    IP               NODE                                    NOMINATED NODE   READINESS GATES
cdp                                    cdp-release-dps-gateway-1.0-595cdcfcd6-kvp7f                      3/3     Terminating   0          2d7h   10.42.5.18       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-grafana-6f577d4bdb-4ll59                              3/3     Terminating   0          2d5h   10.42.5.46       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-monitoring-app-6c9794967-jcdrf                        2/2     Terminating   0          2d7h   10.42.5.32       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-monitoring-metricproxy-6c46bb46d8-jgq62               2/2     Terminating   0          2d7h   10.42.5.33       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-monitoring-pvcservice-5cdb55c9c5-rnp7g                2/2     Terminating   0          2d7h   10.42.5.31       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-thunderhead-kerberosmgmt-api-7d484b8fd6-j9vb2         2/2     Terminating   0          2d7h   10.42.5.21       ds04.ecs.openstack.com   <none>           <none>
cdp                                    cdp-release-thunderhead-sdx2-api-6f9989c764-8cqsj                 2/2     Terminating   0          2d7h   10.42.5.19       ds04.ecs.openstack.com   <none>           <none>
cdp                                    echoserver-9dfbcd89b-h6z4n                                        1/1     Terminating   0          2d7h   10.42.5.17       ds04.ecs.openstack.com   <none>           <none>
cml01                                  livelog-publisher-989hx                                           2/2     Running       2          2d5h   10.42.5.45       ds04.ecs.openstack.com   <none>           <none>
cml01                                  mlx-cml01-pod-evaluator-6445d6946b-txlrq                          1/1     Terminating   0          2d5h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
cml01                                  s2i-builder-58d875f44c-4zn4z                                      2/2     Terminating   6          2d5h   10.42.5.48       ds04.ecs.openstack.com   <none>           <none>
cml01                                  s2i-builder-58d875f44c-ftqhl                                      2/2     Terminating   6          2d5h   10.42.5.49       ds04.ecs.openstack.com   <none>           <none>
cml01                                  s2i-builder-58d875f44c-kqkrk                                      2/2     Terminating   6          2d5h   10.42.5.50       ds04.ecs.openstack.com   <none>           <none>
compute-hive01                         huebackend-0                                                      1/1     Terminating   0          148m   10.42.5.55       ds04.ecs.openstack.com   <none>           <none>
default-ad522d9e-monitoring-platform   monitoring-logger-alert-receiver-7485d89576-vg6n4                 2/2     Terminating   0          2d7h   10.42.5.34       ds04.ecs.openstack.com   <none>           <none>
default-ad522d9e-monitoring-platform   monitoring-metrics-server-exporter-7f977ddff5-ht2n6               2/2     Terminating   0          2d7h   10.42.5.35       ds04.ecs.openstack.com   <none>           <none>
default-ad522d9e-monitoring-platform   monitoring-prometheus-server-664745446f-pw6jc                     3/3     Terminating   0          2d7h   10.42.5.37       ds04.ecs.openstack.com   <none>           <none>
impala-impala01                        impala-executor-000-0                                             1/1     Terminating   0          33h    10.42.5.54       ds04.ecs.openstack.com   <none>           <none>
infra-prometheus                       infra-prometheus-operator-1-1672115985-prometheus-node-expzhqf7   1/1     Running       0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
infra-prometheus                       infra-prometheus-operator-operator-854bdc78b6-2whcf               1/1     Terminating   0          2d7h   10.42.5.15       ds04.ecs.openstack.com   <none>           <none>
kube-system                            kube-proxy-ds04.ecs.openstack.com                  1/1     Running       0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
kube-system                            nvidia-device-plugin-daemonset-j278f                              1/1     Running       0          2d7h   10.42.5.16       ds04.ecs.openstack.com   <none>           <none>
kube-system                            rke2-canal-kmtnv                                                  2/2     Running       0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-attacher-559b9bc796-2wv7w                                     1/1     Terminating   1          2d7h   10.42.5.6        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-provisioner-d7df997cf-g7wbj                                   1/1     Terminating   0          2d7h   10.42.5.8        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-resizer-9db78b867-zkb9k                                       1/1     Terminating   0          2d7h   10.42.5.7        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        csi-snapshotter-74d97b97bf-nlqk2                                  1/1     Terminating   1          2d7h   10.42.5.9        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        engine-image-ei-045573ad-knqc5                                    1/1     Running       0          2d7h   10.42.5.2        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        instance-manager-e-3cf7ad0b                                       1/1     Terminating   0          2d7h   10.42.5.4        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        instance-manager-r-bf10b574                                       1/1     Terminating   0          2d7h   10.42.5.5        ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        longhorn-csi-plugin-qxwk6                                         2/2     Running       0          2d7h   10.42.5.10       ds04.ecs.openstack.com   <none>           <none>
longhorn-system                        longhorn-manager-mbs49                                            1/1     Running       0          2d7h   10.42.5.3        ds04.ecs.openstack.com   <none>           <none>
warehouse-default-datalake-default     hue-query-processor-0                                             1/1     Terminating   0          2d6h   10.42.5.44       ds04.ecs.openstack.com   <none>           <none>
warehouse-default-datalake-default     metastore-0                                                       1/1     Terminating   0          148m   10.42.5.56       ds04.ecs.openstack.com   <none>           <none>

CDW cluster failed with errors “504 Gateway Time-out”.

The root casue is that pod vault-0 automatically was reschedule to ds02 and it’s status is sealed.

$ kubectl get pod vault-0 -n vault-system -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP           NODE                                    NOMINATED NODE   READINESS GATES
vault-0   0/1     Running   0          8m23s   10.42.1.89   ds02.ecs.openstack.com   <none>           <none>

$ curl -k https://vault.localhost.localdomain/v1/sys/seal-status
{"type":"shamir","initialized":true,"sealed":false,"t":1,"n":1,"progress":0,"nonce":"","version":"1.9.0","migration":false,"cluster_name":"vault-cluster-15bfdf25","cluster_id":"f3e2c6f7-cffd-def5-830a-f4bef6522b01","recovery_seal":false,"storage_type":"file"}

You have to manually unseal vault via CM UI.

Job pod-reaper forced deleting pods in terminating state after 18 minutes of service interruption.

$ kubectl logs -n pod-reaper pod-reaper-27871940-469b7
Thu Dec 29 12:20:01 UTC 2022 Starting pod-reaper [Reap older than: 10 minute(s)][Namespace regex: *UNKNOWN*]
Thu Dec 29 12:20:01 UTC 2022 processing namespace cdp
Thu Dec 29 12:20:01 UTC 2022 processing namespace compute-hive01
Thu Dec 29 12:20:01 UTC 2022 processing namespace cml01-user-1
Thu Dec 29 12:20:01 UTC 2022 processing namespace cml01
Thu Dec 29 12:20:01 UTC 2022 processing namespace default-ad522d9e-log-router
Thu Dec 29 12:20:01 UTC 2022 processing namespace impala-impala01
Thu Dec 29 12:20:01 UTC 2022 processing namespace infra-prometheus
Thu Dec 29 12:20:01 UTC 2022 processing namespace default-ad522d9e-monitoring-platform
Thu Dec 29 12:20:01 UTC 2022 processing namespace ecs-webhooks
Thu Dec 29 12:20:01 UTC 2022 processing namespace default
Thu Dec 29 12:20:01 UTC 2022 Force delete pod monitoring-logger-alert-receiver-7485d89576-vg6n4 in namespace default-ad522d9e-monitoring-platform with deletion timestamp: 2022-12-29T12:08:11Z
Thu Dec 29 12:20:01 UTC 2022 Force delete pod infra-prometheus-operator-operator-854bdc78b6-2whcf in namespace infra-prometheus with deletion timestamp: 2022-12-29T12:08:11Z
Thu Dec 29 12:20:01 UTC 2022 Force delete pod huebackend-0 in namespace compute-hive01 with deletion timestamp: 2022-12-29T12:08:11Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "monitoring-logger-alert-receiver-7485d89576-vg6n4" force deleted
pod "infra-prometheus-operator-operator-854bdc78b6-2whcf" force deleted
Thu Dec 29 12:20:01 UTC 2022 Successfully force deleted pod monitoring-logger-alert-receiver-7485d89576-vg6n4 in namespace default-ad522d9e-monitoring-platform
Thu Dec 29 12:20:01 UTC 2022 Successfully force deleted pod infra-prometheus-operator-operator-854bdc78b6-2whcf in namespace infra-prometheus
pod "huebackend-0" force deleted
......

The remaining 9 pods on ds04 are basically daemonset type. Pod impala-impala01 is an exception.

$ kubectl get pods -A -o wide --field-selector spec.nodeName=ds04.ecs.openstack.com
NAMESPACE          NAME                                                              READY   STATUS        RESTARTS   AGE    IP               NODE                                    NOMINATED NODE   READINESS GATES
cml01              livelog-publisher-989hx                                           2/2     Running       2          2d6h   10.42.5.45       ds04.ecs.openstack.com   <none>           <none>
impala-impala01    impala-executor-000-0                                             1/1     Terminating   0          33h    10.42.5.54       ds04.ecs.openstack.com   <none>           <none>
infra-prometheus   infra-prometheus-operator-1-1672115985-prometheus-node-expzhqf7   1/1     Running       0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
kube-system        kube-proxy-ds04.ecs.openstack.com                  1/1     Running       0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
kube-system        nvidia-device-plugin-daemonset-j278f                              1/1     Running       0          2d7h   10.42.5.16       ds04.ecs.openstack.com   <none>           <none>
kube-system        rke2-canal-kmtnv                                                  2/2     Running       0          2d7h   192.168.8.144   ds04.ecs.openstack.com   <none>           <none>
longhorn-system    engine-image-ei-045573ad-knqc5                                    1/1     Running       0          2d7h   10.42.5.2        ds04.ecs.openstack.com   <none>           <none>
longhorn-system    longhorn-csi-plugin-qxwk6                                         2/2     Running       0          2d7h   10.42.5.10       ds04.ecs.openstack.com   <none>           <none>
longhorn-system    longhorn-manager-mbs49                                            1/1     Running       0          2d7h   10.42.5.3        ds04.ecs.openstack.com   <none>           <none>

We also see the same result on k8s web UI with only 9 failed pods.

Pod impala-impala01 has attached local storage pvc scratch-cache-volume-impala-executor-000-0. If you’re using local volumes, and the node crashes, your pod cannot be rescheduled to a different node. It is scheduled to the same node by default. That is the caveat of using local storage, your pod becomes bound forever to one specific node. Both pvc and pod must be forced deleting according to issue 61620.

$ kubectl delete pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01
persistentvolumeclaim "scratch-cache-volume-impala-executor-000-0" deleted

$ kubectl get pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01
NAME                                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
scratch-cache-volume-impala-executor-000-0   Terminating   pvc-a88fb425-55e5-49b8-b3d5-6b371bb6cb7b   94Gi       RWO            local-path     34h

$ kubectl patch pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01 -p '{"metadata":{"finalizers":null}}'
persistentvolumeclaim/scratch-cache-volume-impala-executor-000-0 patched

$ kubectl get pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01
Error from server (NotFound): persistentvolumeclaims "scratch-cache-volume-impala-executor-000-0" not found

$ kubectl delete pod impala-executor-000-0 -n impala-impala01 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "impala-executor-000-0" force deleted

Confirm that CP/CDW/CML works well

5. Conclusion

When an ECS node goes down, the workload pods on it will be forcibly deleted by cronjob pod-reaper and rescheduled to other normal nodes. But there are two exceptions that require manual intervention:
- Pod vault-0 can be automatically evicted, but you have to manually unseal vault via CM UI.
- Pods using local-storage (impala-executor/impala-coordinator/query-executor/query-coordinator) cannot be evicted, please manually delete both pvc and pod.
The maximum service interruption time is 25 minutes after node crash, where:

No.	Timing Distribution	Default Value	Description
1	podEviction Timeout	5min	The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after podEviction timeout
2	Cronjob Execution Cycle	10min	CronJob pod-reaper launched reaper job every 10 minutes
3	REAP_OLDER_THAN	10min	The reaper job will scan all namespaces and force deleting pods that have been in the terminating state for more than 10 minutes