kube-controller-manager checks the node status periodically. Whenever the node status is NotReady and the podEvictionTimeout time is exceeded, all pods on the node will be expelled to other nodes. The specific expulsion speed is also affected by expulsion speed parameters, cluster size, and so on. See kubernetes pod evictions.
2.2 Pod Deletion on the NotReady Node
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. See delete pods.
The only ways in which a Pod in such a state can be removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
2.3 Safely Drain a Node before you bring down the node
You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified. See use kubectl drain to remove a node from service.
When kubectl drain returns successfully, that indicates that all of the pods (except the ones excluded as described in the previous paragraph) have been safely evicted (respecting the desired graceful termination period, and respecting the PodDisruptionBudget you have defined). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.
2.4 Cronjob pod-reaper introduced in PvC 1.4.1
CronJob pod-reaper launched reaper job every 10 minutes. The reaper job will scan all namespaces and force deleting pods that have been in the terminating state for more than 10 minutes.
3. Case #1. An ECS Server is unreachable
Check all ECS nodes
$ kubectl get node
NAME STATUS ROLES AGE VERSION
ds01.ecs.openstack.com Ready control-plane,etcd,master 2d4h v1.21.14+rke2r1
ds02.ecs.openstack.com Ready control-plane,etcd,master 2d4h v1.21.14+rke2r1
ds03.ecs.openstack.com Ready control-plane,etcd,master 2d4h v1.21.14+rke2r1
ds04.ecs.openstack.com Ready <none> 2d4h v1.21.14+rke2r1
ds05.ecs.openstack.com Ready <none> 2d4h v1.21.14+rke2r1
ds06.ecs.openstack.com Ready <none> 2d4h v1.21.14+rke2r1
$ kubectl get node
NAME STATUS ROLES AGE VERSION
ds01.ecs.openstack.com Ready control-plane,etcd,master 2d4h v1.21.14+rke2r1
ds02.ecs.openstack.com NotReady control-plane,etcd,master 2d4h v1.21.14+rke2r1
ds03.ecs.openstack.com Ready control-plane,etcd,master 2d4h v1.21.14+rke2r1
ds04.ecs.openstack.com Ready <none> 2d4h v1.21.14+rke2r1
ds05.ecs.openstack.com Ready <none> 2d4h v1.21.14+rke2r1
ds06.ecs.openstack.com Ready <none> 2d4h v1.21.14+rke2r1
ECS server (ds02, ds03) stderr.log indicated Stopped tunnel to 192.168.8.142.
time="2022-12-29T17:13:37+08:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2022-12-29T17:13:37+08:00" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2022-12-29T17:13:42+08:00" level=info msg="Connecting to proxy" url="wss://192.168.8.142:9345/v1-rke2/connect"
time="2022-12-29T17:13:53+08:00" level=info msg="Stopped tunnel to 192.168.8.142:9345"
time="2022-12-29T17:14:47+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.8.142:9345: connect: no route to host"
time="2022-12-29T17:14:47+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.8.142:9345: connect: no route to host"
ECS Agent (ds04, ds05, ds06 ) stderr.log indicated Updating load balancer server addresses and Stopped tunnel to 192.168.8.142.
time="2022-12-29T17:13:37+08:00" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2022-12-29T17:13:42+08:00" level=info msg="Connecting to proxy" url="wss://192.168.8.142:9345/v1-rke2/connect"
time="2022-12-29T17:14:08+08:00" level=error msg="Tunnel endpoint watch channel closed: {ERROR &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding,Reason:InternalError,Details:&StatusDetails{Name:,Group:,Kind:,Causes:[]StatusCause{StatusCause{Type:UnexpectedServerResponse,Message:unable to decode an event from the watch stream: http2: client connection lost,Field:,},StatusCause{Type:ClientWatchDecoding,Message:unable to decode an event from the watch stream: http2: client connection lost,Field:,},},RetryAfterSeconds:0,UID:,},Code:500,}}"
time="2022-12-29T17:14:13+08:00" level=info msg="Updating load balancer rke2-api-server-agent-load-balancer server addresses ->[192.168.8.141:6443 192.168.8.143:6443 192.168.8.142:6443]"
time="2022-12-29T17:14:13+08:00" level=info msg="Updating load balancer rke2-agent-load-balancer server addresses ->[192.168.8.141:9345 192.168.8.143:9345]"time="2022-12-29T17:14:13+08:00" level=info msg="Stopped tunnel to 192.168.8.142:9345"
time="2022-12-29T17:14:45+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.8.142:9345: connect: no route to host"
time="2022-12-29T17:14:45+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.8.142:9345: connect: no route to host"
As you can see from the HAProxy UI, all ports on ds02 ( http port 80, https port 434) have become Down state
As you can see from the CM UI, ECS server health/Control Plane Health/Kubernetes Health/Longhorn Health start to alarm.
You can also see many pod failures on the k8s web UI.
Most of the pods from ds01 are stuck in terminating state after 300 seconds.
CDW cluster is abnormal with warnings “Hive server service is not ready. Service endpoint may not be reachable! Error Code : undefined”.
Job pod-reaper forced deleting pods in terminating state after 25 minutes of service interruption.
$ kubectl logs -n pod-reaper pod-reaper-27871780-tpwzm
Thu Dec 29 09:40:01 UTC 2022 Starting pod-reaper [Reap older than: 10 minute(s)][Namespace regex: *UNKNOWN*]
Thu Dec 29 09:40:01 UTC 2022 processing namespace cml01
Thu Dec 29 09:40:01 UTC 2022 processing namespace cdp
Thu Dec 29 09:40:01 UTC 2022 processing namespace cml01-user-1
Thu Dec 29 09:40:01 UTC 2022 processing namespace compute-hive01
Thu Dec 29 09:40:01 UTC 2022 processing namespace default
Thu Dec 29 09:40:01 UTC 2022 processing namespace impala-impala01
Thu Dec 29 09:40:01 UTC 2022 processing namespace default-ad522d9e-log-router
Thu Dec 29 09:40:01 UTC 2022 processing namespace default-ad522d9e-monitoring-platform
Thu Dec 29 09:40:01 UTC 2022 processing namespace ecs-webhooks
Thu Dec 29 09:40:01 UTC 2022 processing namespace infra-prometheus
Thu Dec 29 09:40:02 UTC 2022 Force delete pod ecs-tolerations-webhook-7d9454f4f9-549qj in namespace ecs-webhooks with deletion timestamp: 2022-12-29T09:20:25Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Thu Dec 29 09:40:02 UTC 2022 Force delete pod hiveserver2-0 in namespace compute-hive01 with deletion timestamp: 2022-12-29T09:20:25Z
pod "ecs-tolerations-webhook-7d9454f4f9-549qj" force deleted
Thu Dec 29 09:40:02 UTC 2022 Force delete pod catalogd-0 in namespace impala-impala01 with deletion timestamp: 2022-12-29T09:20:25Z
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod ecs-tolerations-webhook-7d9454f4f9-549qj in namespace ecs-webhooks
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "hiveserver2-0" force deleted
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod hiveserver2-0 in namespace compute-hive01
Thu Dec 29 09:40:02 UTC 2022 Force delete pod huebackend-0 in namespace compute-hive01 with deletion timestamp: 2022-12-29T09:20:25Z
pod "catalogd-0" force deleted
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod catalogd-0 in namespace impala-impala01
Thu Dec 29 09:40:02 UTC 2022 Force delete pod api-648d7bc885-xvv92 in namespace cml01 with deletion timestamp: 2022-12-29T09:20:25Z
Thu Dec 29 09:40:02 UTC 2022 Force delete pod huebackend-0 in namespace impala-impala01 with deletion timestamp: 2022-12-29T09:20:25Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Thu Dec 29 09:40:02 UTC 2022 Force delete pod cdp-release-alert-admin-service-68f5c6dd7c-glt9k in namespace cdp with deletion timestamp: 2022-12-29T09:20:25Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "huebackend-0" force deleted
Thu Dec 29 09:40:02 UTC 2022 Successfully force deleted pod huebackend-0 in namespace compute-hive01
......
The remaining 13 pods on ds02 are basically daemonset type. Pod rke2-ingress-nginx-controller is an exception.
CDW cluster failed with errors “504 Gateway Time-out”.
The root casue is that pod vault-0 automatically was reschedule to ds02 and it’s status is sealed.
$ kubectl get pod vault-0 -n vault-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
vault-0 0/1 Running 0 8m23s 10.42.1.89 ds02.ecs.openstack.com <none> <none>
$ curl -k https://vault.localhost.localdomain/v1/sys/seal-status
{"type":"shamir","initialized":true,"sealed":false,"t":1,"n":1,"progress":0,"nonce":"","version":"1.9.0","migration":false,"cluster_name":"vault-cluster-15bfdf25","cluster_id":"f3e2c6f7-cffd-def5-830a-f4bef6522b01","recovery_seal":false,"storage_type":"file"}
You have to manually unseal vault via CM UI.
Job pod-reaper forced deleting pods in terminating state after 18 minutes of service interruption.
$ kubectl logs -n pod-reaper pod-reaper-27871940-469b7
Thu Dec 29 12:20:01 UTC 2022 Starting pod-reaper [Reap older than: 10 minute(s)][Namespace regex: *UNKNOWN*]
Thu Dec 29 12:20:01 UTC 2022 processing namespace cdp
Thu Dec 29 12:20:01 UTC 2022 processing namespace compute-hive01
Thu Dec 29 12:20:01 UTC 2022 processing namespace cml01-user-1
Thu Dec 29 12:20:01 UTC 2022 processing namespace cml01
Thu Dec 29 12:20:01 UTC 2022 processing namespace default-ad522d9e-log-router
Thu Dec 29 12:20:01 UTC 2022 processing namespace impala-impala01
Thu Dec 29 12:20:01 UTC 2022 processing namespace infra-prometheus
Thu Dec 29 12:20:01 UTC 2022 processing namespace default-ad522d9e-monitoring-platform
Thu Dec 29 12:20:01 UTC 2022 processing namespace ecs-webhooks
Thu Dec 29 12:20:01 UTC 2022 processing namespace default
Thu Dec 29 12:20:01 UTC 2022 Force delete pod monitoring-logger-alert-receiver-7485d89576-vg6n4 in namespace default-ad522d9e-monitoring-platform with deletion timestamp: 2022-12-29T12:08:11Z
Thu Dec 29 12:20:01 UTC 2022 Force delete pod infra-prometheus-operator-operator-854bdc78b6-2whcf in namespace infra-prometheus with deletion timestamp: 2022-12-29T12:08:11Z
Thu Dec 29 12:20:01 UTC 2022 Force delete pod huebackend-0 in namespace compute-hive01 with deletion timestamp: 2022-12-29T12:08:11Z
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "monitoring-logger-alert-receiver-7485d89576-vg6n4" force deleted
pod "infra-prometheus-operator-operator-854bdc78b6-2whcf" force deleted
Thu Dec 29 12:20:01 UTC 2022 Successfully force deleted pod monitoring-logger-alert-receiver-7485d89576-vg6n4 in namespace default-ad522d9e-monitoring-platform
Thu Dec 29 12:20:01 UTC 2022 Successfully force deleted pod infra-prometheus-operator-operator-854bdc78b6-2whcf in namespace infra-prometheus
pod "huebackend-0" force deleted
......
The remaining 9 pods on ds04 are basically daemonset type. Pod impala-impala01 is an exception.
We also see the same result on k8s web UI with only 9 failed pods.
Pod impala-impala01 has attached local storage pvc scratch-cache-volume-impala-executor-000-0. If you’re using local volumes, and the node crashes, your pod cannot be rescheduled to a different node. It is scheduled to the same node by default. That is the caveat of using local storage, your pod becomes bound forever to one specific node. Both pvc and pod must be forced deleting according to issue 61620.
$ kubectl delete pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01
persistentvolumeclaim "scratch-cache-volume-impala-executor-000-0" deleted
$ kubectl get pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
scratch-cache-volume-impala-executor-000-0 Terminating pvc-a88fb425-55e5-49b8-b3d5-6b371bb6cb7b 94Gi RWO local-path 34h
$ kubectl patch pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01 -p'{"metadata":{"finalizers":null}}'
persistentvolumeclaim/scratch-cache-volume-impala-executor-000-0 patched
$ kubectl get pvc scratch-cache-volume-impala-executor-000-0 -n impala-impala01
Error from server (NotFound): persistentvolumeclaims "scratch-cache-volume-impala-executor-000-0" not found
$ kubectl delete pod impala-executor-000-0 -n impala-impala01 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "impala-executor-000-0" force deleted
Confirm that CP/CDW/CML works well
5. Conclusion
When an ECS node goes down, the workload pods on it will be forcibly deleted by cronjob pod-reaper and rescheduled to other normal nodes. But there are two exceptions that require manual intervention:
Pod vault-0 can be automatically evicted, but you have to manually unseal vault via CM UI.
Pods using local-storage (impala-executor/impala-coordinator/query-executor/query-coordinator) cannot be evicted, please manually delete both pvc and pod.
The maximum service interruption time is 25 minutes after node crash, where:
No.
Timing Distribution
Default Value
Description
1
podEviction Timeout
5min
The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after podEviction timeout
2
Cronjob Execution Cycle
10min
CronJob pod-reaper launched reaper job every 10 minutes
3
REAP_OLDER_THAN
10min
The reaper job will scan all namespaces and force deleting pods that have been in the terminating state for more than 10 minutes