OpenShift 4: Masters nodes and ETCD cluster, recovery, without backup

3 min readAug 6, 2020

Some time ago I lost a cluster of OpenShift and didn’t have backups (shits happen u_u) the events were as follows:

- The team renewed the certificates
- The API server wasn't completely up
- I missed a master node
- All pods of API Server were crashed
- The ETCD Cluster was corrupted
- I recovered the lost node
- I followed the steps that RedHat recommends to reset the cluster
but since it had no snapshot it didn’t work.
- All cluster died

So it’s clear that we need a snapshot but if we don’t have it, there is something that is not officially documented but the master nodes contain a snapshot of the ETCD node that is running on it.

The path in my case was here: /var/home/core/assets/backup/etcd/member/snap we need the last snap, in this case, the last snap is “db”

Move snapshot to /var/home/core , RedHat script will search snapshot here.

$ sudo cp /var/home/core/assets/backup/etcd/member/snap/db /var/home/core/snapshot-recovery.db

we can use another name for “snapshot-recovery.db” the most important is the extensión file db.

2. We required a manifests-stopped tar, this tar contains YAML’s for the API Server configuration, all definitions for API Server for your cluster are here.

$ cd /var/home/core/assets
$ tar -czvf static-pod-resources manifests-stopped

3. Move tar to /var/home/core

$ mv /var/home/core/assets/static-pod-resources /var/home/core

4. Set value for INITIAL_CLUSTER var, the first value is the ETCD node name and the second value, the FQDN/IP of the node.

$ export INITIAL_CLUSTER="etcd-member-master-0=https://etcd-0.app.example:2380"

5. Execute recovery script

$ sudo /usr/local/bin/etcd-snapshot-restore.sh /var/home/core $INITIAL_CLUSTER

6. Validate status kubelet

$ sudo systemctl status kubelet

7. Restart kubelet if it is stopped, is recommended restart.

$ sudo systemctl start kubelet

8. We need to validate if the ETCD member is running and working with these commands.

$ netstat -anlp | grep 2379
$ sudo crictl ps | grep etcd

9. Access to ETCD pod

id=$(sudo crictl ps  - -name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

10. In the ETCDcontainer, export variables needed for connecting to ETCD

sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)13. In the etcd container, execute etcdctl member list and verify that the member is listed.sh-4.3# etcdctl member list -w table

14.- If all you ETCDmember is working, it’s all, but if are failed, you must remove the ETCD member that are failing and add as a new member

Documentation.

I hope this helps you, if not, please let me know to improve the post.

Chao!

OpenShift 4: Masters nodes and ETCD cluster, recovery, without backup

Written by Gloria Palma González

No responses yet