Appendix B: Node Reboot and Health Check
From time-to-time, you need to manually reboot all the nodes (VMs and hypervisor). Updating the packages does not automatically trigger a subsequent reboot; you must manually reboot the nodes. The reboot sequence presented in Appendix B helps ensure minimal disruption to services. It is equally important to check the health of each service as it is brought back up. This is to maintain a quorum of healthy nodes for high-availability clusters.
The following procedure details the proper sequence and commands for rebooting and checking each service and node. The procedure covers both Juniper Contrail components and Red Hat components for HA environments.
Prerequisites
Before you reboot your nodes:
Deploy the software update: Contrail Cloud update scripts (role by role).
Update and Reboot the Control Plane Nodes
Run the procedure in the order of starting with non-virtual IP (VIP) nodes, and then repeat the procedure for the nodes with VIP.
The sample output lists are deployment-specific. Your list shows output specific to your Contrail Cloud deployment.
All SSH connections to the control hosts need to be done from the jump host to ensure that proper SSH keys are used.
Reboot the undercloud and the jump host:
You need to reboot the undercloud and jump host after rebooting the overcloud.
Shut down your undercloud. From the undercloud:
sudo shutdown -h 0
Shut down Contrail Command as the
contrail
user from the jump host:ssh command sudo shutdown -h 0
Shutdown and reboot the jump host as the
root
user:sudo shutdown -r 0
All services come back up after the jump host is rebooted.
Verify Quorum and Node Health
Purpose
It is important to verify the cluster high availability state and the health state of all the control nodes in the cluster. This sequence verifies that there is a quorum of healthy nodes and that the cluster returns to normal operating status.
Action
This procedure needs to be performed on the target host. Establish an SSH connection to the node, complete the steps, and then exit back out.
To verify all components of an HA cluster:
After rebooting the hypervisor, run a system status check from the jump host as the
contrail
user. Run this check until a healthy status is returned:ssh <control host> sudo systemctl list-units --state=failed
Verify that all VMs have started properly and that no VM is in a failed status:
ssh <control host> sudo virsh list –all
Run a service check from the undercloud for each item on the node until a healthy status is returned:
Check the OpenStack Controller.
Verify the health of the
systemd
services:ansible -i /usr/bin/tripleo-ansible-inventory Controller -m shell -a "sudo systemctl list-units --state=failed 'openstack*' 'neutron*' 'httpd' 'docker' 'ceph*'"
Verify the health of Pacemaker from the OpenStack controller:
ansible -i /usr/bin/tripleo-ansible-inventory Controller -m shell -a "sudo pcs status"
Verify that:
All cluster nodes are online.
No resources are stopped on any of the cluster nodes.
There are no failed Pacemaker actions.
Verify the status of Ceph from the OpenStack controller:
ansible -i /usr/bin/tripleo-ansible-inventory Controller -m shell -a "sudo ceph status"
Verify the status of the Contrail Controller. Run the following command from the Contrail controller node:
ansible -i /usr/bin/tripleo-ansible-inventory ContrailController -m shell -a "sudo contrail-status"
Verify the status of Contrail Analytics. Run the following command from the Contrail Analytics node:
ansible -i /usr/bin/tripleo-ansible-inventory ContrailAnalytics -m shell -a "sudo contrail-status"
Verify the status of Contrail Analytics Database:
ansible -i /usr/bin/tripleo-ansible-inventory ContrailAnalyticsDatabase -m shell -a "sudo contrail-status"
Verify the status of AppFormix:
ansible -i /usr/bin/tripleo-ansible-inventory AppformixController -m shell -a "curl -s http://127.0.0.1:9000/appformix/controller/v2.0/status"
When successful, this returns the code 200. Any other code that is returned should be considered a failure.
The API output also contains the AppFormix version. This is helpful to verify that the correct version has been installed. See the following sample:
{ "Version": "2.19.10-65aa34f7ad", "DBVersion": "70" }
Note:AppFormix and MongoDB containers need about 60 seconds after a reboot to properly come up and synchronize with each other.
To check the MongoDB cluster status:
ansible -i /usr/bin/tripleo-ansible-inventory AppformixController --become -m shell -a 'source /opt/appformix/mongo/config/credentials.js ; echo "rs.status();" | docker exec -i appformix-mongo mongo admin --username "${USERNAME}" --password "${PASSWORD}"'
Verify the status of the ToR services node (TSN):
Note:The TSN node is an optional feature and might not exist in your environment.
ansible -i /usr/bin/tripleo-ansible-inventory ContrailTsn -m shell -a "sudo contrail-status"
Repeat this process after the next cycle of reboots are performed for the next hypervisor.
Note:Always verify that the services and nodes are healthy before proceeding to the reboot procedure for the next hypervisor.
Start the Pacemaker cluster that was previously stopped.
Starting the Pacemaker cluster happens last. This is after all updates and reboots have been performed. All nodes must be up at this point so that Pacemaker detects a quorum. This also assumes that all verifications and health checks came back with the proper status.
Start the Pacemaker cluster (overcloudXXXctrl-N):
ssh <address> sudo pcs cluster start --all
Enable Pacemaker services:
ansible -i /usr/bin/tripleo-ansible-inventory Controller -m shell --become -a 'pcs status| awk "/stonith/{print \$1}" | xargs -n 1 pcs resource enable
Reboot Ceph Storage Nodes
Follow this procedure to properly reboot your Ceph storage nodes.
First, disable the Ceph storage cluster, select a storage node, and then reboot it. After the reboot, verify the status of the node. Repeat the reboot process for all Ceph storage nodes. Enable the cluster and then run a final status report to verify that the cluster health is OK.
Reboot Compute Nodes
Follow this procedure to properly reboot your compute nodes.
First, select the compute node that you want to reboot, and then disable it so that it doesn’t provision new instances. Migrate all instances to another compute node. Reboot the compute node, and enable it once the reboot is complete.