OpenShift 4: Masters nodes and ETCD cluster, recovery, without backup
Some time ago I lost a cluster of OpenShift and didn’t have backups (shits happen u_u) the events were as follows:
- The team renewed the certificates
- The API server wasn't completely up
- I missed a master node
- All pods of API Server were crashed
- The ETCD Cluster was corrupted
- I recovered the lost node
- I followed the steps that RedHat recommends to reset the cluster
but since it had no snapshot it didn’t work.
- All cluster died
So it’s clear that we need a snapshot but if we don’t have it, there is something that is not officially documented but the master nodes contain a snapshot of the ETCD node that is running on it.
The path in my case was here: /var/home/core/assets/backup/etcd/member/snap we need the last snap, in this case, the last snap is “db”
- Move snapshot to /var/home/core , RedHat script will search snapshot here.
$ sudo cp /var/home/core/assets/backup/etcd/member/snap/db /var/home/core/snapshot-recovery.db
we can use another name for “snapshot-recovery.db” the most important is the extensión file db.
2. We required a manifests-stopped tar, this tar contains YAML’s for the API Server configuration, all definitions for API Server for your cluster are here.
$ cd /var/home/core/assets
$ tar -czvf static-pod-resources manifests-stopped
3. Move tar to /var/home/core
$ mv /var/home/core/assets/static-pod-resources /var/home/core
4. Set value for INITIAL_CLUSTER var, the first value is the ETCD node name and the second value, the FQDN/IP of the node.
$ export INITIAL_CLUSTER="etcd-member-master-0=https://etcd-0.app.example:2380"
5. Execute recovery script
$ sudo /usr/local/bin/etcd-snapshot-restore.sh /var/home/core $INITIAL_CLUSTER
6. Validate status kubelet
$ sudo systemctl status kubelet
7. Restart kubelet if it is stopped, is recommended restart.
$ sudo systemctl start kubelet
8. We need to validate if the ETCD member is running and working with these commands.
$ netstat -anlp | grep 2379
$ sudo crictl ps | grep etcd
9. Access to ETCD pod
id=$(sudo crictl ps - -name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
10. In the ETCDcontainer, export variables needed for connecting to ETCD
sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)13. In the etcd container, execute etcdctl member list and verify that the member is listed.sh-4.3# etcdctl member list -w table
14.- If all you ETCDmember is working, it’s all, but if are failed, you must remove the ETCD member that are failing and add as a new member
I hope this helps you, if not, please let me know to improve the post.
Chao!