Asset Connection Resiliency
The SSR software provides path resiliency to a destination when multiple paths exist between peered SSRs. However, a destination may not have another SSR peer in its path. Specifically, an SSR conductor may be deployed on the public internet without a front-end SSR.
Refer to the BCP on Conductor Deployment for recommendations on how to deploy a Conductor.
Without a peering SSR, or without the use of dynamic routing protocols, the SSR considers all next hops equally, even though some paths may not provide a path to the destination. The following diagram depicts a case where two of the three paths to the Conductor are not available, even though the next hop gateway is responsive.
One of the connections an SSR establishes to a Conductor is that of the salt-minion (router) connecting to a salt-master (Conductor). The salt-minion and salt-master maintain a persistent TCP connection on port 4505 which the salt-master uses to publish requests to all salt-minions. When a salt-minion completes a request, it creates a transient connection on port 4506 back to the salt-master, sends an event with the result, then closes the connection.
The effect this problem has on salt is that if an event is sent to an invalid next hop, the request will never make it back to the salt-master. Since the SSR load balances Conductor traffic between all possible paths, some percentage of salt events will fail to be returned to the salt-master, resulting in state inconsistencies between router and Conductor.
In order to avoid these intermittent failures, the SSR can optionally create a long-lived SSH session to the salt-master utilizing FIPS OpenSSH. Long-lived SSH sessions mitigate downstream network failures by ensuring the SSH session is established on at least one of the paths before the salt events are transmitted. As network failures occur, the SSH session will be reestablished on the next available, working path. Salt transactions are instead instantiated within the long-lived SSH connection. The SSR creates SSH sessions for both the salt publish port (4505) and return port (4506).1
Configuration
Asset connection resiliency is configured both at the authority > asset-connection-resiliency
level (so as to apply to every router within an authority) as well as at the router > asset-connection-resiliency
level (so as to provide an override to the value set at the authority). These SSH connections are unnecessary if the network topology already affords reliable salt connectivity.
Troubleshooting
In order to determine if the long-lived SSH sessions are established, executing show system services
will display if the service is running properly:
Retrieving system services...
================== ==================================== ==============
Node Service Active State
================== ==================================== ==============
T114_DUT3.Router 128TWeb.service active
T114_DUT3.Router auditd.service active
T114_DUT3.Router prank.service active
T114_DUT3.Router salt-ssh@192.168.1.14-4505.service active
T114_DUT3.Router salt-ssh@192.168.1.14-4506.service active
T114_DUT3.Router salt-ssh@192.168.1.6-4505.service active
T114_DUT3.Router salt-ssh@192.168.1.6-4506.service active
T114_DUT3.Router t128-process-metrics.service active
T114_DUT3.Router tank.service active
T114_DUT4.Router 128TWeb.service active
T114_DUT4.Router auditd.service active
T114_DUT4.Router prank.service active
T114_DUT4.Router salt-ssh@192.168.1.14-4505.service active
T114_DUT4.Router salt-ssh@192.168.1.14-4506.service active
T114_DUT4.Router salt-ssh@192.168.1.6-4505.service active
T114_DUT4.Router salt-ssh@192.168.1.6-4506.service active
T114_DUT4.Router t128-process-metrics.service active
T114_DUT4.Router tank.service active
Completed in 0.04 seconds
This output shows a total of four SSH connections, one SSH connection per salt port per node of a HA router.