ON THIS PAGE
Understanding High-Availability Failover Scenarios
The following sections describe possible high-availability failure scenarios: how a failure is detected, what recovery action to take, and if applicable, the impact on the system caused by the failure.
Active VIP Node Crashes
Detection
The heartbeat service running on a standby VIP node detects a crash within 10 seconds of not receiving any heartbeat messages from its peer. The JBoss clustering mechanism enables JBoss servers on other nodes to detect that the JBoss server on the failed node is unresponsive, in about 52 seconds.
Recovery
The standby node immediately takes over the VIP address.
Device connections served by the failed node are migrated to the remaining nodes in the cluster. This process starts in about one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The time it takes for the process to complete depends on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, the process is completed within a few minutes.
Impact
The VIP address becomes unavailable for about 10 seconds until it is taken over by the standby node. The GUI or API client access during this period encounters transient errors. In addition, any SNMP traps sent by the devices to the VIP address during this interval are lost.
Device connectivity is down for a few minutes for devices whose connections were being served by the JBoss server on the failed node.
Any jobs that were in progress on the failed node are marked as failed and the reason is indicated.
Starting from Junos Space Network Management Platform 21.1R1, to perform a manual fail over instead of reboot, run the below commands in VIP node CLI:
systemctl restart corosync
systemctl restart pacemaker
Standby VIP Node Crashes
Detection
The JBoss clustering mechanism enables JBoss servers on the other nodes to detect that the JBoss server on the failed node is unresponsive in about 52 seconds.
Recovery
Device connections served by the failed node are migrated to the remaining nodes in the cluster. This process starts in about one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The process completion time depends on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, this process is completed within a few minutes.
Impact
Device connectivity is down for a few minutes for devices whose connections were being served by the JBoss server on the failed node.
Any jobs that were in progress on the failed node are marked as failed and the reason is indicated.
eth0 on the Active VIP Node Goes Down
Detection
The heartbeat service running on the standby VIP node detects the crash within 10 seconds of not receiving any heartbeat messages from its peer. The JBoss clustering mechanism enables JBoss servers on the other nodes to detect that the JBoss server on the failed node is unresponsive, in about 52 seconds.
Recovery
The standby node immediately takes over the VIP address.
Device connections served by the failed node are migrated to the remaining nodes in the cluster. This process starts in about one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The time it takes for the process to complete depend on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, the process is completed within a few minutes.
Impact
The VIP address becomes unavailable for about 10 seconds until it is taken over by the standby node. The GUI or API client access during this period encounters transient errors. In addition, any SNMP traps sent by the devices to the VIP address during this interval are lost.
Device connectivity is down for a few minutes for the devices whose connections were being served by the JBoss server on the failed node.
Any jobs that were in progress on the failed node are marked as failed and the reason is indicated.
eth0 on the Standby VIP Node Goes Down
Detection
The JBoss clustering mechanism enables JBoss servers on the other nodes to detect that the JBoss server on the failed node is unresponsive in about 52 seconds.
Recovery
Device connections served by the failed node are migrated to the remaining nodes in the cluster. This process starts in about one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The process completion time depends on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, this process is completed within a few minutes.
Impact
Device connectivity is down for a few minutes for the devices whose connections were being served by the JBoss server on the failed node.
Any jobs that were in progress on the failed node are marked as failed and the reason is indicated.
A Non-VIP Node Crashes
Detection
The JBoss clustering mechanism enables JBoss servers on the other nodes to detect that the JBoss server on the failed node is unresponsive in about 52 seconds.
Recovery
Device connections served by the failed node are migrated to the remaining nodes in the cluster. This process starts in about one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The time it takes for the process to complete depends on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, this process is completed in a few minutes.
Impact
Device connectivity is down for a few minutes for devices whose connections were served by the JBoss server on the failed node. Any jobs that were in progress on the failed node are marked as failed and the reason is indicated.
eth0 on a Non-VIP Node Goes Down
Detection
The JBoss clustering mechanism enables JBoss servers on the other nodes to detect that the JBoss server on the failed node is unresponsive in about 52 seconds.
Recovery
Device connections served by the failed node are migrated to the remaining nodes in the cluster. This process starts in about 1one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The process completion time depends on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, this process is completed in, a few minutes.
Impact
Device connectivity is down for a few minutes for the devices whose connections were being served by the JBoss server on the failed node.
Any jobs that were in progress on the failed node are marked as failed and the reason is indicated.
eth3 on a Non-VIP Node Goes Down
Detection
The device keepalive monitor detects that all device connections served by this node are down in 15 minutes and marks the connection status of these devices as Down.
Recovery
For connections initiated by Junos Space , Junos Space attempts to reconnect with these devices. Each attempt is made from the cluster node that is determined to be the least loaded in terms of the number of devices it manages. If other nodes in the cluster are significantly less loaded than this node, according to this load-balancing check, reconnection attempts are made from those nodes and they succeed. In this case, connectivity for these devices comes back up in a few minutes. If this node happens to be the least loaded, then all reconnection attempts are made from this node and these attempts continue to fail as long as eth3 remains down.
In the case of device‐initiated connections, the device detects a connection failure in about 15 minutes, and then reconnects with another node in the cluster in the next few seconds.
Impact
Device connectivity is down for devices whose connections were being served by this node. Connectivity might be down for 15 minutes (best case) or until eth3 is brought back up (worst case). In addition, the outage time might vary from device to device depending on which node is chosen to attempt a reconnection for that device. In the case of device‐initiated connections, the outage lasts for a little more than 15 minutes.
eth3 on the Active VIP Node Goes Down
Detection
The device keepalive monitor detects that all device connections served by this node are down in 15 minutes and marks the connection status of these devices as Down.
Recovery
For Jconnections initiated by Junos Space, Junos Space attempts to reconnect with these devices. Each attempt is made from the cluster node that is determined to be the least loaded in terms of the number of devices it manages. If other nodes in the cluster are significantly less loaded than this node, according to this load-balancing check, reconnection attempts are made from those nodes and they succeed. In this case, connectivity for these devices comes back up in a few minutes. If this node happens to be the least loaded, then all reconnection attempts are made from this node and these attempts continue to fail as long as eth3 remains down.
In the case of device‐initiated connections, the device detects a connection failure in about 15 minutes and then reconnects with another node in the cluster in the next few seconds.
Impact
Device connectivity is down for the devices whose connections were being served by this node. Connectivity might be down for 15 minutes (best case) or until eth3 is brought back up (worst case). In addition, the outage time might vary from device to device depending on which node is chosen to attempt a reconnection for that device. In the case of device‐initiated connections, the outage lasts for a little more than 15 minutes.
JBoss Server on a Node Goes Down
Detection
When the JBoss server on a node goes down, other nodes in the JBoss cluster detect the failure in about two seconds) because their TCP connections to the failed JBoss server are closed by the operating system.
Recovery
Device connections served by the failed JBoss server are migrated to the other nodes in the cluster. This process starts in about one minute after the JBoss cluster members detect that the JBoss server on the failed node is down. The time it takes for the process to complete depends on the number of device connections to be migrated, the load on the remaining nodes, and so on. Typically, the process is completed within a few minutes.
The watchdog service (jmp‐watchdog) running on the node detects that the JBoss server is down and restarts it automatically. When the JBoss server comes back up, it is automatically discovered by other cluster members and added to the cluster. It then synchronize its cache from the other nodes in the cluster. The typical restart time for JBoss is two to five minutes. However, it can take more time depending on the number of applications installed, the number of devices being managed, the number of DMI schema versions installed, and so on.
Impact
Device connectivity is down for a few minutes for devices whose connections were being served by the JBoss server that went down.
Any jobs that were in progress on the crashed JBoss server are marked as failed and the reason is indicated.
MySQL Server on the Active VIP Node Goes Down
Detection
If the MySQL server on a node goes down, the watchdog service detects the down MySQL server on that active node in about one to two seconds.
Recovery
The watchdog service immediately restarts the MySQL server on the node. When restarted, the MySQL server comes up in around 20 to 60 seconds.
Impact
The MySQL server on the VIP node is the active database servicing all requests from all JBoss servers in the cluster. This effectively means that a brief database outage could be experienced by JBoss on all nodes for this duration (20 to 60 seconds). Any requests that require database access fail during this period. This results in failures encountered by GUI or API clients on their requests, which internally require database access during this period. This also results in failures of jobs that require database access during this period.
MySQL Server on the Standby VIP Node Goes Down
Detection
If the MySQL server on a node goes down, the watchdog service detects the down MySQL server on that standby node in about one to two seconds.
Recovery
The watchdog service immediately restarts the MySQL server on the node. When restarted, it takes around 20 to 60 seconds for the MySQL server to come up. After it is back up, this server resynchronizes with the primary server in the background and the resynchronization time depends on the number of changes that happened during the outage.
Impact
Since the MySQL server on the standby VIP node is not accessed by JBoss, its downtime does not cause any adverse impact that is noticed by the rest of the system or users of the system.
Primary Database Node Crashes
Detection
The heartbeat service running on the secondary database node detects the crash within 10 seconds of not receiving any heartbeat messages from the primary database node.
Recovery
The database VIP address is transferred to the secondary database node within 10 to 20 seconds. The JBoss servers on other nodes can access the database after the database VIP address is taken over by the secondary database node.
Impact
The database VIP address becomes unavailable for about 10 to 20 seconds until it is taken over by the secondary database node.The MySQL server on the primary database node is the active database servicing all requests from all JBoss servers in the cluster. This effectively means that a brief database outage could be experienced by JBoss on all nodes for this duration (20 to 60 seconds). Any requests that require database access fail during this period. This results in failures encountered by GUI and API clients on their requests that internally require database access during this period. This also results in failures of jobs that require database access during this period.
Secondary Database Node Crashes
Detection
The heartbeat service running on the primary database node detects the crash within 10 seconds of not receiving any heartbeat messages from the secondary database node.
Recovery
The node can be deleted and a new node can be added to the Junos Space cluster as a secondary database node to maintain database high availability.
Impact
Because the MySQL server on the secondary database node is not accessed by JBoss, its downtime does not cause any adverse impact that is noticed by the rest of the system or users of the system.
MySQL Server on the Primary Database Node Goes Down
Detection
If the MySQL server on a node goes down, the watchdog service detects the down MySQL server on that active node in about one to two seconds.
Recovery
The watchdog service immediately restarts the MySQL server on the node. When restarted, the MySQL server comes up in around 20 to 60 seconds.
Impact
The MySQL server on the primary database node is the active database servicing all requests from all JBoss servers in the cluster. This effectively means that a brief database outage could be experienced by JBoss on all nodes for this duration (20 to 60 seconds). Any requests that require database access fail during this period. This results in failures encountered by GUI and API clients on their requests that internally require database access during this period. This also results in failures of jobs that require database access during this period.
MySQL Server on the Secondary Database Node Goes Down
Detection
If the MySQL server on a node goes down, the watchdog service detects the down MySQL server on that standby node in about one to two seconds.
Recovery
The watchdog service immediately restarts the MySQL server on the node. When restarted, it takes around 20 to 60 seconds for the MySQL server to come up. After it is back up, this server resynchronizes with the primary database node in the background. The resynchronization time depends on the number of changes that happened during the outage.
Impact
Because the MySQL server on the secondary database node is not accessed by JBoss, its downtime does not cause any adverse impact that is noticed by the rest of the system or users of the system.
Apache HTTP Server on the Active VIP Node Goes Down
Detection
If the Apache HTTP server on a node goes down, the watchdog service detects the down HTTP server on that node in about one to two seconds.
Recovery
The watchdog service immediately restarts the Apache HTTP server on the node and it becomes ready for service in one second.
Impact
A brief service outage could be experienced by GUI and NBI clients until the Apache HTTP server is restarted. However, this outage is only for a few seconds (typically, two seconds) and is hardly noticed by the clients.
Apache HTTP Server on the Standby VIP Node Goes Down
Detection
If the Apache HTTP server on a node goes down, the watchdog service detects the down HTTP server on that node in about one to two seconds.
Recovery
The watchdog service immediately restarts the Apache HTTP Server on the node and it becomes ready for service in one second.
Impact
No impact.
Dedicated Cassandra Node Crashes
Detection
If the Cassandra node goes down, the watchdog service detects that the Cassandra service is down on that node in about one to two seconds.
Recovery
The Cassandra node that is down must be deleted from the fabric.
Impact
Files cannot be uploaded to or deleted from the Cassandra database until the node that is down is deleted from the fabric.
Cassandra Service on a JBoss Node Goes Down
Detection
If the Cassandra service on a JBoss node goes down, the watchdog service detects that the Cassandra service is down on that node in about one to two seconds.
Recovery
The Cassandra service on the node must be disabled.
Impact
Files cannot be uploaded to or deleted from the Cassandra database until the Cassandra service is disabled on the node.