Appendix B: Node Reboot and Health Check

From time-to-time, you need to manually reboot all the nodes (VMs and hypervisor). Updating the packages does not automatically trigger a subsequent reboot; you must manually reboot the nodes. The reboot sequence presented in Appendix B helps ensure minimal disruption to services. It is equally important to check the health of each service as it is brought back up. This is to maintain a quorum of healthy nodes for high-availability clusters.

The following procedure details the proper sequence and commands for rebooting and checking each service and node. The procedure covers both Juniper Contrail components and Red Hat components for HA environments.

Prerequisites

Before you reboot your nodes:

Deploy the software update: Contrail Cloud update scripts (role by role).

Update and Reboot the Control Plane Nodes

Run the procedure in the order of starting with non-virtual IP (VIP) nodes, and then repeat the procedure for the nodes with VIP.

Note:

The sample output lists are deployment-specific. Your list shows output specific to your Contrail Cloud deployment.

All SSH connections to the control hosts need to be done from the jump host to ensure that proper SSH keys are used.

(Control Hosts: non-VIP nodes to VIP nodes) Identify all the VMs associated with the hypervisor:

Run the following commands from the jumphost:

192.168.213.10 is the address of the control host in the following sample.

Quiet the Pacemaker service.
Pacemaker ensures high availability of the cluster and takes action if node quorum is disrupted. Rebooting the nodes disrupts the quorum of nodes, which is why you need to quiet Pacemaker while the reboot is in progress.
1. Disable Pacemaker resources before stopping the Pacemaker service:
  Note:
  Pacemaker should be brought back up as soon as possible to preserve the integrity of your HA cluster.
2. Identify the node using the following naming format (overcloudXXX-ctrl-N).
  
  SSH to the node and stop the pcs cluster:
Update the packages for each address that was returned in Step 1:
Shut down each VM that was previously updated:

Disable autostart for the VMs from the jump host:

Update the hypervisor after all the VMs have been shut down.
From the jump host, establish an SSH connection to the control host and run the update:

Enable autostart for the VMs:

Reboot the hypervisor.
From the jump host, establish an SSH connection to the control host and run the update:

Note:
After the reboot, the hypervisor starts all VMs that are associated with it along with the Pacemaker service.
Although the Pacemaker service starts after the hypervisor is rebooted, Pacemaker is still in a cluster stop state. You must start the Pacemaker cluster once all services are verified to be up and healthy.
Verify that all services are healthy after you reboot the instances per role in the steps above. Once all services have been verified to be healthy, repeat the process above for the remaining hypervisors and associated VMs.

Reboot the undercloud and the jump host:

You need to reboot the undercloud and jump host after rebooting the overcloud.

Shut down your undercloud. From the undercloud:
Shut down Contrail Command as the contrail user from the jump host:
Shutdown and reboot the jump host as the root user:

All services come back up after the jump host is rebooted.

Purpose

It is important to verify the cluster high availability state and the health state of all the control nodes in the cluster. This sequence verifies that there is a quorum of healthy nodes and that the cluster returns to normal operating status.

Action

This procedure needs to be performed on the target host. Establish an SSH connection to the node, complete the steps, and then exit back out.

To verify all components of an HA cluster:

After rebooting the hypervisor, run a system status check from the jump host as the contrail user. Run this check until a healthy status is returned:
Verify that all VMs have started properly and that no VM is in a failed status:
Run a service check from the undercloud for each item on the node until a healthy status is returned:
1. Check the OpenStack Controller.
  
  Verify the health of the systemd services:
2. Verify the health of Pacemaker from the OpenStack controller:
  Verify that:
  - All cluster nodes are online.
  - No resources are stopped on any of the cluster nodes.
  - There are no failed Pacemaker actions.
3. Verify the status of Ceph from the OpenStack controller:
Verify the status of the Contrail Controller. Run the following command from the Contrail controller node:
Verify the status of Contrail Analytics. Run the following command from the Contrail Analytics node:

Verify the status of Contrail Analytics Database:

Verify the status of AppFormix:
When successful, this returns the code 200. Any other code that is returned should be considered a failure.

The API output also contains the AppFormix version. This is helpful to verify that the correct version has been installed. See the following sample:
Note:
AppFormix and MongoDB containers need about 60 seconds after a reboot to properly come up and synchronize with each other.

To check the MongoDB cluster status:

Verify the status of the ToR services node (TSN):

Note:
The TSN node is an optional feature and might not exist in your environment.
Repeat this process after the next cycle of reboots are performed for the next hypervisor.

Note:
Always verify that the services and nodes are healthy before proceeding to the reboot procedure for the next hypervisor.
Start the Pacemaker cluster that was previously stopped.

Starting the Pacemaker cluster happens last. This is after all updates and reboots have been performed. All nodes must be up at this point so that Pacemaker detects a quorum. This also assumes that all verifications and health checks came back with the proper status.
1. Start the Pacemaker cluster (overcloudXXXctrl-N):
2. Enable Pacemaker services:

Reboot Ceph Storage Nodes

Follow this procedure to properly reboot your Ceph storage nodes.

First, disable the Ceph storage cluster, select a storage node, and then reboot it. After the reboot, verify the status of the node. Repeat the reboot process for all Ceph storage nodes. Enable the cluster and then run a final status report to verify that the cluster health is OK.

Log in to the OpenStack Controller node (overcloudXXXctrl-N) from the undercloud, and disable Ceph storage cluster rebalancing:
Select the storage node that you want to reboot, and reboot the node from the undercloud:
Wait until the storage node reboots, and then verify the status from the OpenStack controller:
Note:
The cluster takes time to return to normal. Do not proceed with this procedure until the cluster returns to a normal state.
Check that the pgmap reports all placement groups (PGs) as normal (active and clean).
Repeat step 2 and step 3 until all storage nodes have been rebooted. Proceed only when Step 3 returns an “active+clean” status.
Note:
You must ensure that Ceph is healthy before you proceed to reboot the other storage nodes.
After rebooting all storage nodes, log in to the OpenStack controller node, and enable Ceph Storage cluster rebalancing:
Note:
Rebalance only if Step 3 returns an “active+clean” status.
Perform a final status check on the OpenStack controller to verify the health of the cluster:

Reboot Compute Nodes

Follow this procedure to properly reboot your compute nodes.

First, select the compute node that you want to reboot, and then disable it so that it doesn’t provision new instances. Migrate all instances to another compute node. Reboot the compute node, and enable it once the reboot is complete.

Log in to the undercloud as the stack user.
View the list of all compute nodes in your deployment.
Find the compute node that you want to reboot.

From the undercloud, select the compute node, and disable it.

Migrate all instances off the compute node. Identify all the instances by pulling a list.
Choose one of the following methods to migrate your instances, that best fits your needs.
1. Migrate the instance to a specific host:
2. Let nova-scheduler automatically select the compute node:
3. Live migrate all instances at once:
Wait until migration is complete, and then confirm that the migration was successful:
Finish migrating the instances until all instances have been removed from the compute node.
Log in to the compute node from the undercloud, and reboot it:
Wait until the reboot is finished, and then enable the compute node:
Verify that the compute node is enabled:
Repeat for all remaining compute nodes in your deployment.

ON THIS PAGE