Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Understanding High Availability Features on Juniper Networks Routers

For Juniper Networks routing platforms running the Junos operating system (Junos OS), high availability refers to the hardware and software components that provide redundancy and reliability for packet-based communications. This topic provides brief overviews of the following high availability features:

Routing Engine Redundancy

Redundant Routing Engines are two Routing Engines that are installed in the same routing platform. One functions as the primary, while the other stands by as a backup should the primary Routing Engine fail. On routing platforms with dual Routing Engines, network reconvergence takes place more quickly than on routing platforms with a single Routing Engine.

Graceful Routing Engine Switchover

Graceful Routing Engine switchover (GRES) enables a routing platform with redundant Routing Engines to continue forwarding packets, even if one Routing Engine fails. Graceful Routing Engine switchover preserves interface and kernel information. Traffic is not interrupted. However, graceful Routing Engine switchover does not preserve the control plane. Neighboring routers detect that the router has experienced a restart and react to the event in a manner prescribed by individual routing protocol specifications.

Note:

To preserve routing during a switchover, graceful Routing Engine switchover must be combined with either graceful restart protocol extensions or nonstop active routing. For more information, see Understanding Graceful Routing Engine Switchover and Nonstop Active Routing Concepts.

Note:

In T Series routers, TX Matrix routers, and TX Matrix Plus routers, the control plane is preserved in case of GRES with NSR, and 75% of line rate worth of traffic per Packet Forwarding Engine remains uninterrupted during GRES.

Nonstop Bridging

Nonstop bridging enables an MX Series 5G Universal Routing Platform with redundant Routing Engines to switch from a primary Routing Engine to a backup Routing Engine without losing Layer 2 Control Protocol (L2CP) information. Nonstop bridging uses the same infrastructure as graceful Routing Engine switchover to preserve interface and kernel information. However, nonstop bridging also saves L2CP information by running the Layer 2 Control Protocol process (l2cpd) on the backup Routing Engine.

Note:

To use nonstop bridging, you must first enable graceful Routing Engine switchover.

Nonstop bridging is supported for the following Layer 2 control protocols:

  • Spanning Tree Protocol (STP)

  • Rapid Spanning Tree Protocol (RSTP)

  • Multiple Spanning Tree Protocol (MSTP)

  • VLAN Spanning Tree Protocol (VSTP)

For more information, see Nonstop Bridging Concepts.

Nonstop Active Routing

Nonstop active routing (NSR) enables a routing platform with redundant Routing Engines to switch from a primary Routing Engine to a backup Routing Engine without alerting peer nodes that a change has occurred. Nonstop active routing uses the same infrastructure as graceful Routing Engine switchover to preserve interface and kernel information. However, nonstop active routing also preserves routing information and protocol sessions by running the routing protocol process (rpd) on both Routing Engines. In addition, nonstop active routing preserves TCP connections maintained in the kernel.

Note:

To use nonstop active routing, you must also configure graceful Routing Engine switchover.

For a list of protocols and features supported by nonstop active routing, see Nonstop Active Routing Protocol and Feature Support.

For more information about nonstop active routing, see Nonstop Active Routing Concepts.

Graceful Restart

With routing protocols, any service interruption requires an affected router to recalculate adjacencies with neighboring routers, restore routing table entries, and update other protocol-specific information. An unprotected restart of a router can result in forwarding delays, route flapping, wait times stemming from protocol reconvergence, and even dropped packets. To alleviate this situation, graceful restart provides extensions to routing protocols. These protocol extensions define two roles for a router—restarting and helper. The extensions signal neighboring routers about a router undergoing a restart and prevent the neighbors from propagating the change in state to the network during a graceful restart wait interval. The main benefits of graceful restart are uninterrupted packet forwarding and temporary suppression of all routing protocol updates. Graceful restart enables a router to pass through intermediate convergence states that are hidden from the rest of the network.

When a router is running graceful restart and the router stops sending and replying to protocol liveness messages (hellos), the adjacencies assume a graceful restart and begin running a timer to monitor the restarting router. During this interval, helper routers do not process an adjacency change for the router that they assume is restarting, but continue active routing with the rest of the network. The helper routers assume that the router can continue stateful forwarding based on the last preserved routing state during the restart.

If the router was actually restarting and is back up before the graceful timer period expires in all of the helper routers, the helper routers provide the router with the routing table, topology table, or label table (depending on the protocol), exit the graceful period, and return to normal network routing.

If the router does not complete its negotiation with helper routers before the graceful timer period expires in all of the helper routers, the helper routers process the router's change in state and send routing updates, so that convergence occurs across the network. If a helper router detects a link failure from the router, the topology change causes the helper router to exit the graceful wait period and to send routing updates, so that network convergence occurs.

To enable a router to undergo a graceful restart, you must include the graceful-restart statement at the global [edit routing-options] or [edit routing-instances instance-name routing-options] hierarchy level. You can, optionally, modify the global settings at the individual protocol level. When a routing session is started, a router that is configured with graceful restart must negotiate with its neighbors to support it when it undergoes a graceful restart. A neighboring router will accept the negotiation and support helper mode without requiring graceful restart to be configured on the neighboring router.

Note:

A Routing Engine switchover event on a helper router that is in graceful wait state causes the router to drop the wait state and to propagate the adjacency’s state change to the network.

Graceful restart is supported for the following protocols and applications:

  • BGP

  • ES-IS

  • IS-IS

  • OSPF/OSPFv3

  • PIM sparse mode

  • RIP/RIPng

  • MPLS-related protocols, including:

    • Label Distribution Protocol (LDP)

    • Resource Reservation Protocol (RSVP)

    • Circuit cross-connect (CCC)

    • Translational cross-connect (TCC)

  • Layer 2 and Layer 3 virtual private networks (VPNs)

For more information, see Graceful Restart Concepts.

Nonstop Active Routing Versus Graceful Restart

Nonstop active routing and graceful restart are two different methods of maintaining high availability. Graceful restart requires a router restart. A router undergoing a graceful restart relies on its neighbors (or helpers) to restore its routing protocol information. The restart is the mechanism by which helpers are signaled to exit the wait interval and start providing routing information to the restarting router For more information, see Graceful Restart Concepts.

In contrast, nonstop active routing does not involve a router restart. Both the primary and backup Routing Engines are running the routing protocol process (rpd) and exchanging updates with neighbors. When one Routing Engine fails, the router simply switches to the active Routing Engine to exchange routing information with neighbors. Because of these feature differences, nonstop routing and graceful restart are mutually exclusive. Nonstop active routing cannot be enabled when the router is configured as a graceful restarting router. If you include the graceful-restart statement at any hierarchy level and the nonstop-routing statement at the [edit routing-options] hierarchy level and try to commit the configuration, the commit request fails. For more information, see Nonstop Active Routing Concepts.

Effects of a Routing Engine Switchover

Effects of a Routing Engine Switchover describes the effects of a Routing Engine switchover when no high availability features are enabled and when graceful Routing Engine switchover, graceful restart, and nonstop active routing features are enabled.

VRRP

The Virtual Router Redundancy Protocol (VRRP) enables hosts on a LAN to make use of redundant routing platforms (primary and backup pairs) on the LAN, requiring only the static configuration of a single default route on the hosts.

The VRRP routing platform pairs share the IP address corresponding to the default route configured on the hosts. At any time, one of the VRRP routing platforms is the primary (active) and the others are backups. If the primary fails, one of the backup routers or switches becomes the new primary router.

VRRP has advantages in ease of administration and network throughput and reliability:

  • It provides a virtual default routing platform.

  • It enables traffic on the LAN to be routed without a single point of failure.

  • A virtual backup router can take over a failed default router:

    • Within a few seconds.

    • With a minimum of VRRP traffic.

    • Without any interaction with the hosts.

Devices running VRRP dynamically elect primary and backup routers. You can also force assignment of primary and backup routers using priorities from 1 through 255, with 255 being the highest priority.

In VRRP operation, the default primary router sends advertisements to backup routers at regular intervals (default 1 second). If a backup router does not receive an advertisement for a set period, the backup router with the next highest priority takes over as primary and begins forwarding packets.

As of Junos OS Release 13.2, VRRP nonstop active routing (NSR) is enabled only when you configure the nonstop-routing statement at the [edit routing-options] or [edit logical system logical-system-name routing-options] hierarchy level.

For more information, see Understanding VRRP.

Unified ISSU

A unified in-service software upgrade (unified ISSU) enables you to upgrade between two different Junos OS Releases with no disruption on the control plane and with minimal disruption of traffic. Unified ISSU is only supported by dual Routing Engine platforms. In addition, graceful Routing Engine switchover (GRES) and nonstop active routing (NSR) must be enabled.

With a unified ISSU, you can eliminate network downtime, reduce operating costs, and deliver higher services levels. For more information, see Getting Started with Unified In-Service Software Upgrade.

Interchassis Redundancy for MX Series Routers Using Virtual Chassis

Interchassis redundancy is a high availability feature that can span equipment located across multiple geographies to prevent network outages and protect routers against access link failures, uplink failures, and wholesale chassis failures without visibly disrupting the attached subscribers or increasing the network management burden for service providers. As more high-priority voice and video traffic is carried on the network, interchassis redundancy has become a requirement for providing stateful redundancy on broadband subscriber management equipment such as broadband services routers, broadband network gateways, and broadband remote access servers. Interchassis redundancy support enables service providers to fulfill strict service-level agreements (SLAs) and avoid unplanned network outages to better meet the needs of their customers.

To provide a stateful interchassis redundancy solution for MX Series 5G Universal Routing Platforms, you can configure a Virtual Chassis. A Virtual Chassis configuration interconnects two MX Series routers into a logical system that you can manage as a single network element. The member routers in a Virtual Chassis are designated as the primary router (also known as the protocol primary) and the backup router (also known as the protocol backup). The member routers are interconnected by means of dedicated Virtual Chassis ports that you configure on Trio Modular Port Concentrator/Modular Interface Card (MPC/MIC) interfaces.

An MX Series Virtual Chassis is managed by the Virtual Chassis Control Protocol (VCCP), which is a dedicated control protocol based on IS-IS. VCCP runs on the Virtual Chassis port interfaces and is responsible for building the Virtual Chassis topology, electing the Virtual Chassis primary router, and establishing the interchassis routing table to route traffic within the Virtual Chassis.

Starting with Junos OS Release 11.2, Virtual Chassis configurations are supported on MX240, MX480, and MX960 Universal Routing Platforms with Trio MPC/MIC interfaces and dual Routing Engines. In addition, graceful Routing Engine switchover (GRES) and nonstop active routing (NSR) must be enabled on both member routers in the Virtual Chassis.

Platform-Specific High Availability Behavior on ACX7000 Series

The hardware architecture on ACX7000 series of devices differs from PTX and MX series devices. In PTX and MX series devices, FPC hosts both the datapath PFE as well as the WAN facing ports (PIC/MIC). In PTX and MX series devices, each FPCs are designed to include the CPU compute resource to manage the FPC components.

On ACX7000 series of devices, the Forwarding Engine Board (FEB) FRU contains only the PFE complex, and the Routing Engine contains the CPU compute complex. Routing Engine FRU executes both Routing Engine and line-card applications.

Due to single FEB and ASIC complexities, ACX7332 and ACX7348 devices have a few constraints that you need to be aware of during switchover.

The following table shows the attributes and feature support for high availability on ACX7000 series of devices:

Table 1: High availability attributes and features on ACX7000 series

High availability attributes and features

ACX7509

ACX7332 and ACX7348

Control plane (RE) redundancy

Yes

Yes

Data plane (PFE) redundancy

Yes

No

GRES+GR

Yes

Yes

GRES+NSR

Yes

Yes

The following ACX7000 series of devices support routing redundancy:

  • ACX7509 supports dual FEB boards and dual RE boards
  • ACX7332 and ACX7348 support single FEB board and dual RE board. Routing Engine switchover impacts data traffic.

On ACX7348, we support both GRES, graceful restart (GR), and nonstop active routing (NSR). We do not support PFE redundancy. If you enable graceful restart and when Routing Engine switchover happens, the existing flows experience no transit traffic loss.

Note:

If you alter the present flow or introduce a new flow during the Routing Engine switchover, the convergence does not take place until the switchover completes. Topology changes during the switchover are applied only after switchover. Traffic loss and minor statistics loss is expected during switchover.

GRES is enabled by default on Junos Evolved operating system and cannot be disabled

To preserve routing during a switchover, GRES must be combined with either:

  • Graceful restart (GR) protocol extensions
  • Nonstop active routing (NSR) and Nonstop Bridging (NSB)

On ACX7348, GRES is not supported for the following features:

  • Broadband network gateway (BNG)
  • VXLAN
  • sFlow
  • J-Flow
  • Port mirroring
  • SRv6 dynamic SID
  • SRv6-TE
  • MACsec
  • Timing
  • LLDP
  • TWAMP
  • RFC 2544-based benchmarking tests
  • EVPN-VXLAN Multicast

Before issuing any switchover command from the primary Routing Engine, check the status of the backup Routing Engine using the show system switchover command on the backup Routing Engine. If switchover status is ready, then issue the switchover command.

The switchover command can be issued even if the backup Routing Engine is not ready. In this case, Routing Engine will switchover the primary Routing Engine (even though the backup is not ready) and the system behavior is indeterminate.

Routing Engine switchover results in statistics accounting loss for the duration of switchover time.

ACX7509 supports Routing Engine redundancy as mentioned in the following table.

Table 2: ACX7509 Routing Engine Redundancy

System configuration

Redundancy

Single RE / single FEB

Not applicable. System works in non-redundant mode

Dual RE / dual FEB

Supported

Dual RE / single FEB

Not supported. System works in non-redundant mode

Single RE / dual FEB

Not supported. System works in non-redundant mode

The modular chassis design of ACX7332 and ACX7348 devices support redundant Routing Engines. The Forwarding engine Board (FEB) is a built-in card and hosts the forwarding ASIC. There is one CPU complex for each of the RE in the chassis. FEB and FPC are CPU-less and memoryless cards. IO(FPC) cards are only used to provide different port fan outs.

All the apps (RE as well as FPC apps) run on the RE CPU. At any point of time only one of the RE (Active or backup) can access the BCM ASIC on the FEB card. The control and data plane will be owned by either RE0 or RE1 based on mastership.

For Graceful RE switchover conditions, traffic loss is characterized by:

  • Zero traffic loss for flows already configured before switchover. There should be zero traffic loss for such flows as the FEB is already programmed before switchover. Any update (or configuration change) which happens while the switchover is in progress will be applied once the switchover is complete.
  • Host path traffic loss (keep alive)–maximum of 1 second. There should be a maximum of 1 second of host path traffic loss. This means that all the protocols which have a tolerance of up to 1 second should not flap. Protocols which have tolerance less than this might flap and will reconverge post switchover completion. Centralized (RE based) BFD sessions with a timer of 300 ms may flap. It is recommended to have a higher timer for such sessions. Distributed or inlined BFD sessions will not have any impact.

The following statistics losses are expected during Routing Engine switchover:

  • The loss of statistics is limited to 1s worth of traffic. However, if there is a rollover which happens during this window, then the loss may be higher. Some of the affected statistics could be the following:
    • IFL statistics
    • Firewall statistics
    • VoQ statistics (Interface queue stats)
    • LSP statistics, etc.
  • Host path, debug and error statistics will not be retained after switchover. These statistics are shown when you issue the show pfe statistics command. The statistics will start from 0 on the new active after switchover.

Timing applications will run only on active primary Routing Engine and won't be running on the backup Routing Engine. Timing applications restart on Routing Engine switchover. During Routing Engine switchover, either graceful or non-graceful RE switchover, PTP, GM, and SYNCE will lose the lock, and the box will go to FREERUN state. The PTP packet path within the hardware will be broken. All the downstream devices will switch to an alternate primary device in the network. If no alternate primary is present, then all the downstream devices will go to a HOLDOVER state.

There is a change in the behavior of request chassis routing-engine master switch command. The primary Routing Engine switchover happens within a certain time (unlike immediately for other platforms supporting HA). Switchovers may happen faster, and it depends upon the load of the system, any ongoing network topology changes, configure changes, etc. If the switchover takes more than the published time, then the system goes for a cold boot of the PFE ASIC (packet drop, protocol flap, interface flap, etc.).

On pressing the Online/Offline button of the primary Routing Engine, the switchover to backup Routing Engine happens gracefully. It is safe to remove the Routing Engine card after the Routing Engine LEDs are turned off. Pressing the button on the backup Routing Engine has no effect on the primary Routing Engine.

On issuing the request node reboot <re0 | re1> command, the primary Routing Engine gracefully switchovers to the backup Routing Engine. Node reboot of backup Routing Engine will not cause any switchover. There might be a delay in executing this CLI command as the software syncs to the backup before switching over. The in and at options may affect if the switchover is graceful or not. Hence these options are not recommended for GRES. For single RE node (no redundancy) the options work fine.