Skip to main content

Packet forwarding stops in AWS after several hours

Packet forwarding on the SSR stops after several hours of runtime in AWS when using Elastic Network Adapter (ENA) driver.

Issue ID: I95-56233
Last Updated: 2024-04-26
Introduced in SSR Version: 6.1.7

Problem

The ENA command-queue becomes unresponsive (device not responding) and the interface no longer passes traffic.

Indications of the symptoms can be observed by viewing the output of show device-interface extended-statistics. There will be an excessive amount of rx errors indicated by the system.

 Extended Statistics:
bw_in_allowance_exceeded: 0
bw_out_allowance_exceeded: 0
conntrack_allowance_exceeded: 0
dev_start: 0
dev_stop: 0
device_port: 4
dpdk_port: 0
linklocal_allowance_exceeded: 0
pps_allowance_exceeded: 0
rx_errors: 139836994071144
rx_good_bytes: 0
rx_good_packets: 0
rx_mbuf_allocation_errors: 139838439323150
rx_missed_errors: 139837561586960
rx_q0_bytes: 1876012
rx_q0_cnt: 0
rx_q0_errors: 2734322
rx_q0_l3_csum_bad: 0
rx_q0_l4_csum_bad: 0
rx_q0_l4_csum_good: 0
rx_q0_packets: 875075952
rx_q0_refill_partial: 0
rx_q1_bytes: 1862808
rx_q1_cnt: 0
rx_q1_errors: 2100224

Additionally, the following log messages provide evidence of the failure event: DPDK.log

Apr 02 23:47:22.351 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_com_wait_and_process_admin_cq_interrupts]Timeout waiting for comp_ctx->wait_event
Apr 02 23:47:22.351 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_com_wait_and_process_admin_cq_interrupts]The ena device didn't send a completion for the admin cmd 11 status 0
Apr 02 23:47:22.351 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_get_dev_stats]Failed to get stats. error: -62
Apr 02 23:47:22.351 [DPDK| -- ] ERROR (deviceManagerTP) ena_stats_get(): Could not retrieve statistics from ENA
Apr 02 23:47:25.352 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_com_wait_and_process_admin_cq_interrupts]Timeout waiting for comp_ctx->wait_event
Apr 02 23:47:25.352 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_com_wait_and_process_admin_cq_interrupts]The ena device didn't send a completion for the admin cmd 11 status 0
Apr 02 23:47:25.352 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_get_dev_stats]Failed to get stats. error: -62
Apr 02 23:47:25.352 [DPDK| -- ] ERROR (deviceManagerTP) ena_stats_get(): Could not retrieve statistics from ENA
Apr 02 23:47:26.353 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_get_dev_stats]Failed to get stats. error: -19
Apr 02 23:47:26.353 [DPDK| -- ] ERROR (deviceManagerTP) ena_stats_get(): Could not retrieve statistics from ENA
Apr 02 23:47:26.354 [DPDK| -- ] ERROR (deviceManagerTP) [ENA_COM: ena_get_dev_stats]Failed to get stats. error: -19

journal.log

Apr 03 08:26:02.642583 relayroutereast001 highway[9604]: Failed to queue packet; No ARP for x.x.x.1, interface 1.0, ex: ArpEntryPendException: Pend list max size of 5 exceeded
Apr 03 08:26:02.642595 relayroutereast001 highway[9604]: Send packet failed for session remote-host: ip-address: x.x.x.1; local-ip x.x.x.242; local-intf: 1.0; peer-name Router11-W

highway.log

Apr 03 09:25:35.173 [LINK|BFDS] ERROR (bfdAgentTP     ) Failed to queue packet; No ARP for x.x.x.195, interface 1.0, ex: ArpEntryPendException: Pend list max size of 5 exceeded
Apr 03 09:25:35.173 [LINK|BFDS] ERROR (bfdAgentTP ) Send packet failed for session remote-host: ip-address: x.x.x.195; local-ip x.x.x.242; local-intf: 1.0; peer-name RelayRouter001

Release Notes

The ENA command-queue becomes unresponsive (device not responding) and the interface no longer passes traffic.

Updated ENA device driver from DPDK/Amazon (upgrading driver from 2.6.0 to 2.9.0). ENA release notes

ENA supports health checks which can be used to detect faulty behavior of the hardware and the driver. As such, an ENA driver watchdog has been developed to check for the following reset reasons:

  • HW unresponsiveness – Driver detected that the periodic keep alive signals stopped arriving from the HW through the Asynchronous Event Notification Queue (AENQ).
  • Faulty admin queue behavior – The admin queue entered a faulty state.
  • Faulty behavior on the IO path – The device returned invalid descriptors or caused the driver to enter the invalid state.
  • Missing Tx completions exceeds a dynamically calculated threshold - Prevent Tx ring stalls in case Tx descriptors were not returned by the HW.

Severity

Details

The potential impact of a software defect if encountered. Severity levels are:

  • Critical: Could severely affect service, capacity/traffic, and maintenance capabilities. May have a prolonged impact to the entire system.
  • Major: Could seriously affect system operation, maintenance, administration and related tasks.
  • Minor: Would not significantly impair the functioning or affect service.

Critical

Status

Open

Resolved In

Product

SSR

Functional Area

Datapath

Workaround

info

Juniper may provide a method to temporarily circumvent a problem; workarounds do not exist for all issues.

Disable enhanced networking in AWS instance.

A restart of the SSR is necessary to remediate this condition: systemctl restart 128T