Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

close
keyboard_arrow_left
NorthStar Controller User Guide
Table of Contents Expand all
list Table of Contents

NorthStar Controller Troubleshooting Guide

date_range 02-Nov-23

This document includes strategies for identifying whether an apparent problem stems from the NorthStar Controller or from the router, and provides troubleshooting techniques for those problems that are identified as stemming from the NorthStar Controller.

Before you begin any troubleshooting investigation, confirm that all system processes are up and running. A sample list of processes is shown below. Your actual list of processes could be different.

content_copy zoom_out_map
[root@node-1 ~]# supervisorctl status
bmp:bmpMonitor                   RUNNING   pid 2957, uptime 0:58:02
collector:worker1                RUNNING   pid 19921, uptime 0:01:42
collector:worker2                RUNNING   pid 19923, uptime 0:01:42
collector:worker3                RUNNING   pid 19922, uptime 0:01:42
collector:worker4                RUNNING   pid 19924, uptime 0:01:42
collector_main:beat_scheduler    RUNNING   pid 19770, uptime 0:01:53
collector_main:es_publisher      RUNNING   pid 19771, uptime 0:01:53
collector_main:task_scheduler    RUNNING   pid 19772, uptime 0:01:53
config:cmgd                      RUNNING   pid 22087, uptime 0:01:53 
config:cmgd-rest                 RUNNING   pid 22088, uptime 0:01:53
docker:dockerd                   RUNNING   pid 4368, uptime 0:57:34
epe:epeplanner                   RUNNING   pid 9047, uptime 0:50:34
infra:cassandra                  RUNNING   pid 2971, uptime 0:58:02
infra:ha_agent                   RUNNING   pid 9009, uptime 0:50:45
infra:healthmonitor              RUNNING   pid 9172, uptime 0:49:40
infra:license_monitor            RUNNING   pid 2968, uptime 0:58:02
infra:prunedb                    RUNNING   pid 19770, uptime 0:01:53
infra:rabbitmq                   RUNNING   pid 7712, uptime 0:52:03
infra:redis_server               RUNNING   pid 2970, uptime 0:58:02
infra:zookeeper                  RUNNING   pid 2965, uptime 0:58:02
ipe:ipe_app                      RUNNING   pid 2956, uptime 0:58:02
listener1:listener1_00           RUNNING   pid 9212, uptime 0:49:29
netconf:netconfd_00              RUNNING   pid 19768, uptime 0:01:53
northstar:anycastGrouper         RUNNING   pid 19762, uptime 0:01:53
northstar:configServer           RUNNING   pid 19767, uptime 0:01:53
northstar:mladapter              RUNNING   pid 19765, uptime 0:01:53
northstar:npat                   RUNNING   pid 19766, uptime 0:01:53
northstar:pceserver              RUNNING   pid 19441, uptime 0:02:59
northstar:privatet1vproxy        RUNNING   pid 19432, uptime 0:02:59
northstar:prpdclient             RUNNING   pid 19763, uptime 0:01:53
northstar:scheduler              RUNNING   pid 19764, uptime 0:01:53
northstar:topologyfilter         RUNNING   pid 19760, uptime 0:01:53
northstar:toposerver             RUNNING   pid 19762, uptime 0:01:53
northstar_pcs:PCServer           RUNNING   pid 19487, uptime 0:02:49
northstar_pcs:PCViewer           RUNNING   pid 19486, uptime 0:02:49
web:app                          RUNNING   pid 19273, uptime 0:03:18
web:gui                          RUNNING   pid 19280, uptime 0:03:18
web:notification                 RUNNING   pid 19272, uptime 0:03:18
web:proxy                        RUNNING   pid 19275, uptime 0:03:18
web:restconf                     RUNNING   pid 19271, uptime 0:03:18
web:resthandler                  RUNNING   pid 19275, uptime 0:03:18

Restart any processes that display as STOPPED instead of RUNNING.

Note:

To stop, start, or restart all processes, use the service northstar stop, service northstar start, and service northstar restart commands.

To access system process status information from the NorthStar Controller Web UI, navigate to More Options>Administration and select System Health.

The current CPU %, memory usage, virtual memory usage, and other statistics for each system process are displayed. Figure 1 shows an example.

Note:

Only processes that are running are included in this display.

Figure 1: Process Status Display Process Status Display

Table 1 describes each field displayed in the Process Status table.

Table 1: Descriptions of Process Status Fields
Field Description

Process

The name of the NorthStar Controller process.

PID

The Process ID number.

User

The NorthStar Controller user permissions required to access information about this process.

Group

NorthStar Controller user group permissions required to access information about this process.

CPU%

Displays current percentage of CPU currently in use by this process.

Memory

Displays current percentage of memory currently in use by this process.

Virtual Memory

Displays current Virtual memory in use by this process.

CPU Time

The amount of time the CPU was used for processing instructions for the process

CMD

Displays the specific command options for the system process.

The troubleshooting information is presented in the following sections:

NorthStar Controller Log Files

Throughout your troubleshooting efforts, it can be helpful to view various NorthStar Controller log files. To access log files:

  1. Log in to the NorthStar Controller Web UI.

  2. Navigate to More Options > Administration and select Logs.

    A list of NorthStar system log and message files is displayed, a truncated example of which is shown in Figure 2.

    Figure 2: Sample of System Log and Message FilesSample of System Log and Message Files
  3. Click the log file or message file that you want to view.

    The log file contents are displayed in a pop-up window.

  4. To open the file in a separate browser window or tab, click View Raw Log in the pop-up window.

  5. To close the pop-up window and return to the list of log and message files, click X in the upper right corner of the pop-up window.

Table 2 lists the NorthStar Controller log files most commonly used to identify and troubleshoot issues with the PCS and PCE.

Table 2: Top NorthStar Controller Troubleshooting Log Files

Log File

Description

Location

pcep_server.log

Log entries related to the PCEP server. The PCEP server maintains the PCEP session. The log contains information about communication between the PCC and the PCE in both directions.

To configure verbose PCEP server logging:

  1. From the NorthStar Controller CLI, run pcep_cli.

  2. Type set log-level all.

  3. Press CTRL-C to exit.

/var/log/jnc

pcs.log

Log entries related to the PCS. The PCS is responsible for path computation. This log includes events received by the PCS from the Toposerver, including provisioning orders. It also contains notification of communication errors and issues that prevent the PCS from starting up properly.

/opt/northstar/logs

toposerver.log

Log entries related to the topology server. The topology server is responsible for maintaining the topology. These logs contain the record of the events between the PCS and the Toposerver, the Toposerver and NTAD, and the Toposerver and the PCE server

/opt/northstar/logs

Table 3 lists additional log files that can also be helpful for troubleshooting. All of the log files in Table 3 are located under the /opt/northstar/logs directory.

Table 3: Additional Log Files for Troubleshooting NorthStar Controller
Log Files Description

cassandra.msg

Log events related to the cassandra database.

ha_agent.msg

HA coordinator log.

mlAdaptor.log

Interface to transport controller log.

net_setup.log

Configuration script log.

nodejs.msg

Log events related to nodejs.

pcep_server.log

Log files related to communication between the PCC and the PCE in both directions.

pcs.log

Log files related to the PCS, which includes any event received by PCS from Toposerver and any event from Toposerver to PCS including provisioning orders. This log also contains any communication errors as well as any issues that prevent the PCS from starting up properly.

rest_api.log

Logs files of REST API requests.

toposerver.log

Log files related to the topology server.

Contains the record of the events between the PCS and topology server, the topology server and NTAD, and the topology server and the PCE server

Note:

Any message forwarded to the pcshandler.log file is also forwarded to the pcs.log file.

To see logs related to the Junos VM, you must establish a telnet session to the router. The default IP address for the Junos VM is 172.16.16.2. The Junos VM is responsible for maintaining the necessary BGP, ISIS, or OSPF sessions.

Empty Topology

Figure 3 illustrates the flow of information from the router to the Toposerver that results in the topology display in the NorthStar Controller UI. When the topology display is empty, it is likely this flow has been interrupted. Finding out where the flow was interrupted can guide your problem resolution process.

Figure 3: Topology Information FlowTopology Information Flow

The topology originates at the routers. For NorthStar Controller to receive the topology, there must be a BGP-LS, ISIS, or OSPF session from one of the routers in the network to the Junos VM. There must also be an established Network Topology Abstractor Daemon (NTAD) session between the Junos VM and the Toposerver.

To check these connections:

  1. Using the NorthStar Controller CLI, verify that the NTAD connection between the Toposerver and the Junos VM was successfully established as shown in this example:

    content_copy zoom_out_map
    [root@northstar ~]# netstat -na | grep :450
    
    tcp        0      0 172.16.16.1:55752           172.16.16.2:450             ESTABLISHED
    
    Note:

    Port 450 is the port used for Junos VM to Toposerver connections.

    In the following example, the NTAD connection has not been established:

    content_copy zoom_out_map
    [root@northstar ~]# netstat -na | grep :450
    
    tcp        0      0 172.16.16.1:55752           172.16.16.2:450             LISTENING
    
  2. Log in to the Junos VM to confirm whether NTAD is configured to enable topology export. The grep command below gives you the IP address of the Junos VM.

    content_copy zoom_out_map
    [root@northstar ~]# grep "ntad_host" /opt/northstar/data/northstar.cfg
    
    ntad_host=172.16.16.2
    [root@northstar ~]# telnet 172.16.16.2
    
    Trying 172.16.16.2...
    Connected to 172.16.16.2.
    Escape character is '^]'.
    
    northstar_junosvm (ttyp0)
    
    
    login: northstar
    Password:
    
    --- JUNOS 14.2R4.9 built 2015-08-25 21:01:39 UTC
    
    This JunOS VM is running in non-persistent mode.
    If you make any changes on this JunOS VM,
    Please make sure you save to the Host using net_setup.py utility, otherwise the config will be lost if this VM is restarted.
    
    northstar@northstar_junosvm> show configuration protocols | display set
    
    set protocols topology-export
    

    If the topology-export statement is missing, the Junos VM cannot export data to the Toposerver.

  3. Use Junos OS show commands to confirm whether the BGP, ISIS, or OSPF relationship between the Junos VM and the router is ACTIVE. If the session is not ACTIVE, the topology information cannot be sent to the Junos VM.

  4. On the Junos VM, verify whether the lsdist.0 routing table has any entries:

    content_copy zoom_out_map
    northstar@northstar_junosvm> show route table lsdist.0 terse | match lsdist.0
    
    lsdist.0: 54 destinations, 54 routes (54 active, 0 holddown, 0 hidden)
    

    If you see only zeros in the lsdist.0 routing table, there is no topology that can be sent. Review the NorthStar Controller Getting Started Guide sections on configuring topology acquisition.

  5. Ensure that there is at least one link in the lsdist.0 routing table. The Toposerver can only generate an initial topology if it receives at least one NTAD link event. A network that consists of a single node with no IGP adjacency with other nodes (as is possible in a lab environment, for example), will not enable the Toposerver to generate a topology. Figure 4 illustrates the Toposerver’s logic process for creating the initial topology.

    Figure 4: Logic Process for Initial Topology CreationLogic Process for Initial Topology Creation

    If an initial topology cannot be created for this reason, the toposerver.log generates an entry similar to the following example:

    content_copy zoom_out_map
    Dec  9 16:03:57.788514 fe-cluster-03 TopoServer  Did not send the topology because no links were found.
    

NTAD Version

If you see that SR LSPs have not been provisioned and the pcs.log shows messages similar to this example:

content_copy zoom_out_map
2020 Apr 27 15:05:36.430366 ns1-site1-q-pod07 PCServer [NorthStar][PCServer][Routing] msg=0x0000300b Provided path is not valid for SR for sean427@0110.0000.0101 path=sean427, node 0110.0000.0104 has no NodeIndex

It might be that the NTAD version is incorrect. See Installing the NorthStar Controller for information on NTAD versions.

Incorrect Topology

One important function of the Toposerver is to correlate the unidirectional link (interface) information from the routers into bidirectional links by matching source and destination IPv4 Link_Identifiers from NTAD link events. When the topology displayed in the NorthStar UI does not appear to be correct, it can be helpful to understand how the Toposerver handles the generation and maintenance of the bidirectional links.

Generation and maintenance of bidirectional links is a complex process, but here are some key points:

  • For the two nodes constituting each bidirectional link, the Node ID that was assigned first (and therefore has the lower Node ID number) is given the Node A designation, and the other node is given the Node Z designation.

    Note:

    The Node ID is assigned when the Toposerver first receives the Node event from NTAD.

  • Whenever a Node ID is cleared and reassigned (such as during a Toposerver restart or network model reset), the Node IDs and therefore, the A and Z designations, can change.

  • The Toposerver receives a Link Update message when a link in the network is added or modified.

  • The Toposerver receives a Link Withdraw message when a link is removed from the network.

  • The Link Update and Link Withdraw messages affect the operational status of the nodes.

  • The node operational status, together with the protocol (IGP versus IGP plus MPLS) determine whether a link can be used to route LSPs. For a link to be used to route LSPs, it must have both an operational status of UP and the MPLS protocol active.

Missing LSPs

When your topology is displaying correctly, but you have missing LSPs, take a look at the flow of information from the PCC to the Toposerver that results in tunnels being added to the NorthStar Controller UI, as illustrated in Figure 5. The flow begins with the configuration at the PCC, from which an LSP Update message is passed to the PCEP server by way of a PCEP session and then to the Toposerver by way of an Advanced Message Queuing Protocol (AMQP) connection.

Figure 5: LSP Information FlowLSP Information Flow

To check these connections:

  1. Look at the toposerver.log. The log prints a message every 15 seconds when it detects that its connection with the PCEP server has been lost or was never successfully established. Note that in the following example, the connection between the Toposerver and the PCEP server is marked as down.

    content_copy zoom_out_map
    Toposerver log:
    Apr 22 16:21:35.016721 user-PCS TopoServer  Warning, did not receive the PCE beacon within 15 seconds, marking it as down. Last up: Fri Apr 22 16:21:05 2016
    Apr 22 16:21:35.016901 user-PCS TopoServer  [->PCS] PCE Down: Warning, did not receive the PCE beacon within 15 seconds, marking it as down. Last up: Fri Apr 22 16:21:05 2016
    Apr 22 16:21:50.030592 user-PCS TopoServer  Warning, did not receive the PCE beacon within 15 seconds, marking it as down. Last up: Fri Apr 22 16:21:05 2016
    Apr 22 16:21:50.031268 user-PCS TopoServer  [->PCS] PCE Down: Warning, did not receive the PCE beacon within 15 seconds, marking it as down. Last up: Fri Apr 22 16:21:05 2016
    
  2. Using the NorthStar Controller CLI, verify that the PCEP session between the PCC and the PCEP server was successfully established as shown in this example:

    content_copy zoom_out_map
    [root@northstar ~]# netstat -na | grep :4189
    tcp        0      0 0.0.0.0:4189                0.0.0.0:*                   LISTEN
    tcp        0      0 172.25.152.42:4189          172.25.155.50:59143         ESTABLISHED
    tcp        0      0 172.25.152.42:4189          172.25.155.48:65083         ESTABLISHED
    
    Note:

    Port 4189 is the port used for PCC to PCEP server connections.

    Knowing that the session has been established is useful, but it does not necessarily mean that any data was transferred.

  3. Verify whether the PCEP server learned about any LSPs from the PCC.

    content_copy zoom_out_map
    [root@user-PCS ~]# pcep_cli
    # show lsp all list
    2016-04-22 17:09:39.696061(19661)[DEBUG]: pcc_lsp_table.begin:
    2016-04-22 17:09:39.696101(19661)[DEBUG]: pcc-id:1033771436/172.25.158.61, state: 0
    
    2016-04-22 17:09:39.696112(19661)[DEBUG]: START of LSP-NAME-TABLE
    …
    2016-04-22 17:09:39.705358(19661)[DEBUG]: Summary pcc_lsp_table:
    2016-04-22 17:09:39.705366(19661)[DEBUG]:   Summary LSP name tabl:
    2016-04-22 17:09:39.705375(19661)[DEBUG]:     client_id:1033771436/172.25.158.61, state:0,num LSPs:13
    2016-04-22 17:09:39.705388(19661)[DEBUG]:     client_id:1100880300/172.25.158.65, state:0,num LSPs:6
    2016-04-22 17:09:39.705399(19661)[DEBUG]:     client_id:1117657516/172.25.158.66, state:0,num LSPs:23
    2016-04-22 17:09:39.705410(19661)[DEBUG]:     client_id:1134434732/172.25.158.67, state:0,num LSPs:4
    2016-04-22 17:09:39.705420(19661)[DEBUG]:   Summary LSP id table:
    2016-04-22 17:09:39.705429(19661)[DEBUG]:     client_id:1033771436/172.25.158.61, state:0, num LSPs:13
    2016-04-22 17:09:39.705440(19661)[DEBUG]:     client_id:1100880300/172.25.158.65, state:0, num LSPs:6
    2016-04-22 17:09:39.705451(19661)[DEBUG]:     client_id:1117657516/172.25.158.66, state:0, num LSPs:23
    2016-04-22 17:09:39.705461(19661)[DEBUG]:     client_id:1134434732/172.25.158.67, state:0, num LSPs:4
    

    In the far right column of the output, you see the number of LSPs that were learned. If this number is 0, no LSP information was sent to the PCEP server. In that case, check the configuration on the PCC side, as described in the NorthStar Controller Getting Started Guide.

LSP Controller Statuses

You can view the controller status of LSPs in the Controller Status column in the Tunnels tab of the Network Information table (in the NorthStar Controller GUI).

Table 4 lists the various controller statuses and their descriptions.

Table 4: LSP Controller Statuses

Controller Status

Indicates That

FAILED

The NorthStar Controller has failed to provision the LSP.

PENDING

The PCS has sent an LSP provisioning order to the PCEP sever. The PCS is awaiting a response from the PCEP server.

PCC_PENDING

The PCEP server has sent an LSP provisioning order to the PCC. The PCS is awaiting a response from the PCC.

NETCONF_PENDING

The PCS has sent an LSP provisioning order to netconfd. The PCS is awaiting a response from netconfd.

PRPD_PENDING

The PCS has sent an LSP provisioning order to the PRPD client to provision a BGP route. The PCS is awaiting a response from the PRPD client.

SCHEDULED_DELETE

The PCS has scheduled the LSP to be deleted; the PCS will send the deletion provisioning order to the PCC.

SCHEDULED_DISCONNECT

The PCS has scheduled the LSP to be disconnected. The LSP will be moved to Shutdown status; the LSP is retained in the NorthStar datastore with a Persist state associated with it and is not used in CSPF calculations.

NoRoute_Rescheduled

The PCS hasn’t found a path for the LSP. The PCS will scan the LSPs periodically and will try to find a path for the LSP that hasn’t been routed and then, schedule its reprovisioning.

FRR_DETOUR_Rescheduled

The PCS has detoured the LSP and rescheduled the LSP’s re-provisioning.

Provision_Rescheduled

The PCS has scheduled the LSP to be provisioned.

Maint_NotHandled

The LSP is not part of the ongoing maintenance event as the LSP is not controlled by NorthStar.

Maint_Rerouted

The PCS has rerouted the LSP due to maintenance.

Callsetup_Scheduled

The PCS must provision the LSP when the event starts.

Disconnect_Scheduled

The PCS must disconnect the LSP when the event ends.

No path found

The PCS was unable to find a path for the LSP.

Path found on down LSP

The PCEP server has reported that the LSP is Down but the PCS has found a path for the LSP.

Path include loops

The SR-LSP has one or more loops.

Maint_NotReroute_DivPathUp

The LSP is not rerouted due to the maintenance event as there’s a standby path already up and running.

Maint_NotReroute_NodeDown

The LSP is not rerouted as the maintenance event is for the endpoints of the LSP.

PLANNED_LSP

The LSP must be provisioned but is not in the provisioning queue yet.

PLANNED_DISCONNECT

The LSP must be disconnected but is not in the provisioning queue yet.

PLANNED_DELETE

The LSP must be deleted but is not in the provisioning queue yet.

Candidate_ReOptimization

The PCS has selected the LSP as a candidate for reoptimization.

Activated(used_by_primary)

Secondary path for the LSP is activated.

Time_Expired

Scheduled window for the LSP has expired.

PCEP_Capability_not_supported

PCEP may not be supported on the device, or if supported, PCEP may either not be configured, may be disabled, or misconfigured on the device.

De-activated

NorthStar Controller has deactivated the secondary LSP.

NS_ERR_NCC_NOT_FOUND

The NorthStar Controller is unable to use the Netconf Connection Client (NCC) to establish a Netconf connection to the device. Workaround: Restart Netconf on the NorthStar server.

[root@pcs-1 templates]# supervisorctl restart netconf
netconf:netconf: stopped
netconf:netconf: started

SR LSP provisioning requires LSP statefull SR capability

You must configure the following command on the Junos device through the CLI, to provision the SR LSP:

set protocols pcep pce <name> spring-capability

PCC That is Not PCEP-Enabled

The Toposerver associates the PCEP sessions with the nodes in the topology from the TED in order to make a node PCEP-enabled. This Toposerver function is hindered if the IP address used by the PCC to establish the PCEP session was not the one automatically learned by the Toposerver from the TED. For example, if a PCEP session is established using the management IP address, the Toposerver will not receive that IP address from the TED.

When the PCC successfully establishes a PCEP session, it sends a PCC_SYNC_COMPLETE message to the Toposerver. This message indicates to NorthStar that synchronization is complete. The following is a sample of the corresponding toposerver log entries, showing both the PCC_SYNC_COMPLETE message and the PCEP IP address that NorthStar might or might not recognize:

content_copy zoom_out_map
Dec  9 17:12:11.610225 fe-cluster-03 TopoServer  NSTopo::updateNode (PCCNodeEvent) ip: 172.25.155.26 pcc_ip: 172.25.155.26 evt_type: PCC_SYNC_COMPLETE
Dec  9 17:12:11.610230 fe-cluster-03 TopoServer  Adding PCEP flag to pcep_ip: 172.25.155.26 node_id: 0880.0000.0026 router_ID: 88.0.0.26 protocols: 4
Dec  9 17:12:11.610232 fe-cluster-03 TopoServer  Setting live pcep_ip: 172.25.155.26 for router_ID: 88.0.0.26

Some options for correcting the problem of an unrecognized IP address are:

  • Manually input the unrecognized IP address in the device profile in the NorthStar Web UI by navigating to More Options > Administration > Device Profile.

  • Ensure there is at least one LSP originating on the router, which will allow Toposerver to associate the PCEP session with the node in the TED database.

Once the IP address problem is resolved, and the Toposerver is able to successfully associate the PCEP session with the node in the topology, it adds the PCEP IP address to the node attributes as can be seen in the PCS log:

content_copy zoom_out_map
Dec  9 17:12:11.611392 fe-cluster-03 PCServer  [<-TopoServer] routing_key = ns_node_update_key
Dec  9 17:12:11.611394 fe-cluster-03 PCServer  [<-TopoServer] NODE UPDATE(Live): ID=0880.0000.0026 protocols=(20)ISIS2,PCEP status=UNKNOWN hostname=skynet_26 router_ID=88.0.0.26 iso=0880.0000.0026 isis_area=490001 AS=41 mgmt_ip=172.25.155.26 source=NTAD Hostname=skynet_26 pcep_ip=172.25.155.26

LSP Stuck in PENDING or PCC_PENDING State

Once nodes are correctly established as PCEP-enabled, you could start provisioning LSPs. It is possible for the LSP controller status to indicate PENDING or PCC_PENDING as seen in the Tunnels tab of the Web UI network information table (Controller Status column). This section explains how to interpret those statuses.

When an LSP is being provisioned, the PCS server computes a path that satisfies all the requirements for the LSP, and then sends a provisioning order to the PCEP server. Log messages similar to the following example appear in the PCS log while this process is taking place:

content_copy zoom_out_map
Apr Apr 25 10:06:44.798336 user-PCS PCServer  [->TopoServer] push lsp configlet, action=ADD
Apr 25 10:06:44.798341 user-PCS PCServer  {#012"lsps":[#012{"request-id":928380025,"name":"JTAC","from":"10.0.0.102",
 "to":"10.0.0.104","pcc":"172.25.158.66","bandwidth":"100000","metric":0,"local-protection":false,"type":"primary",
"association-group-id":0,"path-attributes":{"admin-group":{"exclude":0,"include-all":0, "include-any":0},"setup-priority":
7,"reservation-priority":7,"ero":[{"ipv4-address":"10.102.105.2"},{"ipv4-address":"10.105.107.2"}, {"ipv4-address":
"10.114.117.1"}]}}#012]#012}
Apr 25 10:06:44.802500 user-PCS PCServer  provisioning order sent, status = SUCCESS
Apr 25 10:06:44.802519 user-PCS PCServer  [->TopoServer] Save LSP action, id=928380025 event=Provisioning Order(ADD) sent request_id=928380025
Apr 25 10:06:44.802534 user-PCS PCServer   lsp action=ADD JTAC@10.0.0.102 path= controller_state=PENDING

The LSP controller status is PENDING at this point, meaning that the provisioning order has been sent to the PCEP server, but an acknowledgement has not yet been received. If an LSP is stuck at PENDING, it suggests that the problem lies with the PCEP server. You can log into the PCEP server and configure verbose log messages which can provide additional information of possible troubleshooting value:

content_copy zoom_out_map
pcep_cli
set log-level all

There are also a variety of show commands on the PCEP server that can display useful information. Just as with Junos OS syntax, you can enter show ? to see the show command options.

If the PCEP server successfully receives the provisioning order, it performs two actions:

  • It forwards the order to the PCC.

  • It sends an acknowledgement back to the PCS.

The PCEP server log would show an entry similar to the following example:

content_copy zoom_out_map
2016-04-25 10:06:45.196263(27897)[EVENT]: 172.25.158.66:JTAC UPD RCVD FROM PCC, ack 928380025
2016-04-25 10:06:45.196517(27897)[EVENT]: 172.25.158.66:JTAC ADD SENT TO   PCS 928380025, UP

The LSP controller status changes to PCC_PENDING, indicating that the PCEP server received the provisioning order and forwarded it on to the PCC, but the PCC has not yet responded. If an LSP is stuck at PCC_PENDING, it suggests that the problem lies with the PCC.

If the PCC receives the provisioning order successfully, it sends a response to the PCEP server, which in turn, forwards the response to the PCS. When the PCS receives this response, it clears the LSP controller status completely, indicating that the LSP is fully provisioned and is not waiting for action from the PCEP server or PCC. The operational status (Op Status column) then becomes the indicator for the condition of the tunnel.

The PCS log would show an entry similar to the following example:

content_copy zoom_out_map
Apr 25 10:06:45.203909 user-PCS PCServer  [<-TopoServer] JTAC@10.0.0.102, LSP event=(0)CREATE request_id=928380025 tunnel_id=9513 lsp_id=1 report_type=ACK

LSP That is Not Active

If an LSP provisioning order is successfully sent and acknowledged, and the controller status is cleared, it is still possible that the LSP is not up and running. If the operational status of the LSP is DOWN, the PCC cannot signal the LSP. This section explores some of the possible reasons for the LSP operational status to be DOWN.

Utilization is a key concept related to LSPs that are stuck in DOWN. There are two types of utilization, and they can be different from each other at any specific time:

  • Live utilization—This type is used by the routers in the network to signal an LSP path. This type of utilization is learned from the TED by way of NTAD. You might see PCS log entries such as those in the following example. In particular, note the reservable bandwidth (reservable_bw) entries that advertise the RSVP utilization on the link:

    content_copy zoom_out_map
    Apr 25 10:10:11.475686 user-PCS PCServer  [<-TopoServer] LINK UPDATE: ID=L10.105.107.1_10.105.107.2 status=UP nodeA=0110.0000.0105 nodeZ=0110.0000.0107 protocols=(260)ISIS2,MPLS
    Apr 25 10:10:11.475690 user-PCS PCServer    [A->Z] ID=L10.105.107.1_10.105.107.2 IP address=10.105.107.1 bw=10000000000 max_rsvp_bw=10000000000 te_metric=10 color=0 reservable_bw={9599699968 8599699456 7599699456 7599699456 7599699456 7599699456 7599699456 7099599360 }
    Apr 25 10:10:11.475694 user-PCS PCServer    [Z->A] ID=L10.105.107.1_10.105.107.2 IP address=10.105.107.2 bw=10000000000 max_rsvp_bw=10000000000 te_metric=10 color=0 reservable_bw={10000000000 10000000000 10000000000 8999999488 7899999232 7899999232 7899999232 7899999232 }
    
  • Planned utilization—This type is used within NorthStar Controller for path computation. This utilization is learned from PCEP when the router advertises the LSP and communicates to NorthStar the LSP bandwidth and the path the LSP is to use. You might see PCS log entries such as those in the following example. In particular, note the bandwidth (bw) and record route object (RRO) entries that advertise the RSVP utilization on the link:

    content_copy zoom_out_map
    Apr 25 10:06:45.208021 ns-PCS PCServer  [<-TopoServer] routing_key = ns_lsp_link_key
    Apr 25 10:06:45.208034 ns-PCS PCServer  [<-TopoServer] JTAC@10.0.0.102, LSP event=(2)UPDATE request_id=0 tunnel_id=9513 lsp_id=1 report_type=STATE_CHANGE
    Apr 25 10:06:45.208039 ns-PCS PCServer   JTAC@10.0.0.102, lsp add/update event lsp_state=ACTIVE admin_state=UP, delegated=true
    Apr 25 10:06:45.208042 ns-PCS PCServer   from=10.0.0.102 to=10.0.0.104
    Apr 25 10:06:45.208046 ns-PCS PCServer   primary path
    Apr 25 10:06:45.208049 ns-PCS PCServer   association.group_id=128 association_type=1
    Apr 25 10:06:45.208052 ns-PCS PCServer    priority=7/7 bw=100000 metric=30
    Apr 25 10:06:45.208056 ns-PCS PCServer    admin group bits exclude=0 include_any=0 include_all=0
    Apr 25 10:06:45.208059 ns-PCS PCServer    PCE initiated
    Apr 25 10:06:45.208062 ns-PCS PCServer    ERO=0110.0000.0102--10.102.105.2--10.105.107.2--10.114.117.1
    Apr 25 10:06:45.208065 ns-PCS PCServer    RRO=0110.0000.0102--10.102.105.2--10.105.107.2--10.114.117.1
    Apr 25 10:06:45.208068 ns-PCS PCServer    samepath, state changed
    

It is possible for the two utilizations to be different enough from each other that it causes interference with successful computation or signalling of the path. For example, if the planned utilization is higher than the live utilization, a path computation issue could arise in which the PCS cannot compute the path because it thinks there is no room for it. But because the planned utilization is higher than the actual live utilization, there may very well be room.

It’s also possible for the planned utilization to be lower than the live utilization. In that case, the PCC does not signal the path because it thinks there is no room for it.

To view utilization in the Web UI topology map, navigate to Options in the left pane of the Topology view. If you select RSVP Live Utilization, the topology map reflects the live utilization that comes from the routers. If you select RSVP Utilization, the topology map reflects the planned utilization which is computed by the NorthStar Controller based on planned properties.

A better troubleshooting tool in the Web UI is the Network Model Audit widget in the Dashboard view. The Link RSVP Utilization line item reflects whether there are any mismatches between the live and the planned utilizations. If there are, you can try executing Sync Network Model from the Web UI by navigating to Administration > System Settings, and then clicking Advanced Settings in the upper right corner of the resulting window.

Note:

The upper right corner button toggles between General Settings and Advanced Settings.

PCS Out of Sync with Toposerver

If the PCS becomes out of sync with Toposerver such that they do not agree on the state of LSPs, you must deactivate and reactivate the PCEP protocol in order to restore synchronization. Perform the following steps on the NorthStar server.

CAUTION:

Be aware that following this procedure:

  • Kills the PCEP sessions for all PCCs, not just the one with which there is a problem.

  • Results in the loss of all user data which then needs to be repopulated.

  • Has an impact on a production system due to the resynchronization.

  1. Stop the PCE server and wait 10 seconds to allow the PCC to remove all lingering LSPs.

    content_copy zoom_out_map
    supervisorctl stop northstar:pceserver
  2. Restart the PCE server.

    content_copy zoom_out_map
    supervisorctl start northstar:pceserver
  3. Restart Toposerver.

    content_copy zoom_out_map
    supervisorctl restart northstar:toposerver
    Note:

    An alternative way to restart Toposerver is to perform a Reset Network Model from the NorthStar Controller web UI (Administration > System Settings, Advanced). See the Disappearing Changes section for more information about the Sync Network Model and Reset Network Model operations.

Disappearing Changes

Two options are available in the Web UI for synchronizing the topology with the live network. These options are only available to the system administrator, and can be accessed by first navigating to Administration > System Settings, and then clicking Advanced Settings in the upper right corner of the resulting window.

Note:

The upper right corner button toggles between General Settings and Advanced Settings.

Figure 6 shows the two options that are displayed.

Figure 6: Synchronization OperationsSynchronization Operations

It is important to be aware that if you execute Reset Network Model in the Web UI, you will lose changes that you’ve made to the database. In a multi-user environment, one user might reset the network model without the knowledge of the other users. When a reset is requested, the request goes from the PCS server to the Toposerver, and the PCS log reflects:

content_copy zoom_out_map
Apr 25 10:54:50.385008 user-PCS PCServer  [->TopoServer] Request topology reset

The Toposerver log then reflects that database elements are being removed:

content_copy zoom_out_map
Apr 25 10:54:50.386912 user-PCS TopoServer  Truncating pcs.links...
Apr 25 10:54:50.469722 user-PCS TopoServer  Truncating pcs.nodes...
Apr 25 10:54:50.517501 user-PCS TopoServer  Truncating pcs.lsps...
Apr 25 10:54:50.753705 user-PCS TopoServer  Truncating pcs.interfaces...
Apr 25 10:54:50.806737 user-PCS TopoServer  Truncating pcs.facilities...

The Toposerver then requests a synchronization with both the Junos VM to retrieve the topology nodes and links, and with the PCEP server to retrieve the LSPs. In this way, the Toposerver relearns the topology, but any user updates are missing. Figure 7 illustrates the flow from the topology reset request to the request for synchronization with the Junos VM and the PCEP Server.

Figure 7: Reset Model RequestReset Model Request

Upon receipt of the synchronization requests, Junos VM and the PCEP server return topology updates that reflect the current live network. The PCS log shows this information being added to the database:

content_copy zoom_out_map
Apr 25 10:54:52.237882 user-PCS PCServer  [<-TopoServer] Update Topology
Apr 25 10:54:52.237894 user-PCS PCServer  [<-TopoServer] Update Topology Persisted Nodes (0)
Apr 25 10:54:52.238957 user-PCS PCServer  [<-TopoServer] Update Topology Live Nodes (7)
Apr 25 10:54:52.242336 user-PCS PCServer  [<-TopoServer] Update Topology Persisted Links (0)
Apr 25 10:54:52.242372 user-PCS PCServer  [<-TopoServer] Update Topology live Links (10)
Apr 25 10:54:52.242556 user-PCS PCServer  [<-TopoServer] Update Topology Persisted Facilities (1)
Apr 25 10:54:52.242674 user-PCS PCServer  [<-TopoServer] Update Topology Persisted LSPs (0)
Apr 25 10:54:52.279716 user-PCS PCServer  [<-TopoServer] Update Topology Live LSPs (47)
Apr 25 10:54:52.279765 user-PCS PCServer  [<-TopoServer] Update Topology Finished

Figure 8 illustrates the return of topology updates from the Junos VM and the PCEP Server to the Toposerver and the PCS.

Figure 8: Model Updates Using Reset Network ModelModel Updates Using Reset Network Model

You should use the Reset Network Model when you want to start over from scratch with your topology, but if you don’t want to lose user planning data when synchronizing with the live network, execute the Sync Network Model operation instead. With this operation, the PCS still requests a topology synchronization, but the Toposerver does not delete the existing elements. Figure 9 illustrates the flow from the PCS to the Junos VM and PCEP server, and the updates coming back to the Toposerver.

Figure 9: Synchronization Request and Model Updates Using Sync Network ModelSynchronization Request and Model Updates Using Sync Network Model

Investigating Client Side Issues

If you are looking for the source of a problem, and you cannot find it on the server side of the system, there is a debugging flag that can help you find it on the client side. The flag enables detailed messages on the web browser console about what has been exchanged between the server and the client. For example, you might notice that an update is not reflected in the Web UI. Using these detailed messages, you can identify possible miscommunication between the server and the client such as the server not actually sending the update, for example.

To enable this debug flag, modify the URL you use to launch the Web UI as follows:

content_copy zoom_out_map
https://server_address:8443/client/app.html?debug=true
Note:

If you are already in the Web UI, it is not necessary to log out; simply add ?debug=true to the URL and press Enter. The UI reloads.

Figure 10 shows an example of the web browser console with detailed debugging messages.

Figure 10: Web Browser Console with Debugging MessagesWeb Browser Console with Debugging Messages

Accessing the console varies by browser. Figure 11 shows an example: accessing the console on Google Chrome.

Figure 11: Accessing the Google Chrome ConsoleAccessing the Google Chrome Console

Incomplete Results of the Bandwidth Sizing Scheduled Task

If execution of the bandwidth sizing scheduled task does not result in publishing statistics for all the bandwidth sizing-enabled LSPs, check to see if the traffic statistics are being collected for all the bandwidth sizing-enabled LSPs for the scheduled duration. If traffic statistics are not available, the bandwidth statistics for those LSPs cannot be resized.

You can use the NorthStar Collector web UI to determine whether traffic statistics are being collected:

  1. Open the Tunnel tab in the network information table.

  2. Select the LSPs that have not been resized.

  3. Right-click and select View LSP Traffic.

  4. Click custom in the upper left corner, provide the schedule duration, and click Submit.

Troubleshooting NorthStar Integration with HealthBot

If update device to HealthBot is failing in NorthStar, first check to see if there are errors in the NorthStar web application server logs:

content_copy zoom_out_map
[root@ns1-site1 ~]# tail -f /opt/northstar/logs/web_app.msg

2019 Oct 15 02:46:49.824 - info: Request:  User:admin (full):http:GET:127.0.0.1:/NorthStar/API/v1/tenant/1/RouterProfiles/vendorList
2019 Oct 15 02:46:52.165 - info: Request:  User:admin (full):http:GET:127.0.0.1:/NorthStar/API/v1/tenant/1/RouterProfiles/liveNetwork
2019 Oct 15 02:47:10.466 - info: Request:  User:admin (full):http:POST:127.0.0.1:/NorthStar/API/v2/tenant/1/RouterProfiles/healthbot/updateDevices
req: {}
2019 Oct 15 02:47:17.084 - debug: Devices updated, Healthbot response body = ""
2019 Oct 15 02:47:17.512 - info: Request:  User:admin (full):http:POST:127.0.0.1:/NorthStar/API/v2/tenant/1/RouterProfiles/healthbot/updateDeviceGroup
req: {"devices":["vmx104","vmx101","vmx107","vmx103","vmx106","vmx105","vmx102"]}
2019 Oct 15 02:47:18.453 - debug: Device Group updated, Healthbot response body = ""
2019 Oct 15 02:47:18.860 - info: Request:  User:admin (full):http:POST:127.0.0.1:/NorthStar/API/v2/tenant/1/RouterProfiles/healthbot/commitConfigs
2019 Oct 15 02:47:18.935 - debug: Commit completed, Healthbot response body = "{\n  \"detail\": \"Committing the configuration.\",\n  \"status\": 202,\n  \"url\": \"/api/v1/configuration/jobs/?job_id=c6be7387-bfbf-45e4-97c8-993f27bcbe09\"\n}\n"

The HealthBot API server logs might also provide helpful information if update device to HealthBot is failing:

content_copy zoom_out_map
root@healthbot-vm1:~# healthbot logs --device-group healthbot -s api_server
docker logs 1557243a5b 2>&1 | vi -
Vim: Reading from stdin...

To determine if RPM probe data and LDP demands statistics collection is working, access the IAgent container logs. IAgent is used for RPM data (link latency) and LDP demands statistics collection.

content_copy zoom_out_map
root@healthbot-vm1:~# docker ps | grep iagent | grep northstar
3492c1f3774f        healthbot_iagent:2.1.0-beta-custom       "/entrypoint.sh salt…"   23 hours ago        Up 23 hours                                                                                                                                  device-group-northstar_device-group-northstar-iagent_1

root@healthbot-vm1:~# docker exec -it 7382325c375f  bash

root@3492c1f3774f:/# tail -f /tmp/inter-packet-export.log
2019-10-15 07:19:15,329 inter-packet.ns_link_latency Aggregates sent for 4 objects for node=vmx106

2019-10-15 07:19:24,546 inter-packet.ns_demand aggregates sent for 6 objects for node=vmx102
2019-10-15 07:19:27,522 inter-packet.ns_demand aggregates sent for 6 objects for node=vmx101
2019-10-15 07:19:33,788 inter-packet.ns_demand aggregates sent for 6 objects for node=vmx105
2019-10-15 07:19:38,110 inter-packet.ns_demand aggregates sent for 6 objects for node=vmx104
2019-10-15 07:19:39,251 inter-packet.ns_demand aggregates sent for 6 objects for node=vmx103
2019-10-15 07:20:04,654 inter-packet.ns_link_latency Aggregates sent for 2 objects for node=vmx104

2019-10-15 07:20:05,878 inter-packet.ns_link_latency Aggregates sent for 4 objects for node=vmx105

2019-10-15 07:20:06,535 inter-packet.ns_link_latency Aggregates sent for 1 objects for node=vmx103

2019-10-15 07:20:07,537 inter-packet.ns_link_latency Aggregates sent for 3 objects for node=vmx101

2019-10-15 07:20:09,479 inter-packet.ns_link_latency Aggregates sent for 4 objects for node=vmx102

2019-10-15 07:20:15,332 inter-packet.ns_link_latency Aggregates sent for 4 objects for node=vmx106

2019-10-15 07:21:04,657 inter-packet.ns_link_latency Aggregates sent for 2 objects for node=vmx104

2019-10-15 07:21:05,881 inter-packet.ns_link_latency Aggregates sent for 4 objects for node=vmx105

2019-10-15 07:21:06,538 inter-packet.ns_link_latency Aggregates sent for 1 objects for node=vmx103

2019-10-15 07:21:07,540 inter-packet.ns_link_latency Aggregates sent for 3 objects for node=vmx101

2019-10-15 07:21:09,484 inter-packet.ns_link_latency Aggregates sent for 4 objects for node=vm

To determine if JTI LSP and interface statistics data collection is working, access the fluentd container logs. Native GBP is used for JTI data collection.

content_copy zoom_out_map
root@healthbot-vm1:~# docker ps | grep fluentd | grep northstar
5fa268d0410b        healthbot_fluentd:2.1.0-beta-custom      "/fluentd/etc/startu…"   20 hours ago        Up 20 hours         5140/tcp, 0.0.0.0:4000->4000/tcp, 0.0.0.0:4000->4000/udp, 24224/tcp                                                      device-group-northstar_device-group-northstar-fluentd_1

root@healthbot-vm1:~# docker exec -it 5fa268d0410b  bash

root@5fa268d0410b:/# tail -f /tmp/inter-packet-export.log
2019-10-15 06:00:01,241 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:01:01,245 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:02:01,248 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:03:01,255 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:04:01,259 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:05:01,265 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:06:01,269 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:07:01,274 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:08:01,279 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105
2019-10-15 06:09:01,285 inter-packet.ns_interface_traffic aggregates sent for 24 objects for node=vmx105

To determine if statistics data is being notified from the HealthBot server to the PCS, access the PCS logs to see live statistics notification information:

content_copy zoom_out_map
[root@ns1-site1-q-pod21 ~]# tail -f /opt/northstar/logs/pcs.log
2019 Oct 15 00:09:19.221768 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005002  ge-0/0/5.3@vmx102 out=0 in=-1
2019 Oct 15 00:09:19.221783 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005002  ge-0/0/1.0@vmx102 out=0 in=-1
2019 Oct 15 00:09:19.221798 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005002  ge-0/0/5.200@vmx102 out=0 in=-1
2019 Oct 15 00:09:19.221812 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005002  ge-0/0/5.301@vmx102 out=0 in=-1
2019 Oct 15 00:09:19.880395 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][<-AMQP] msg=0x00004018 exchange=controller.wan.stats routing_key=ns_tunnel_traffic
2019 Oct 15 00:09:19.880456 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005004  test1_102_105-1@vmx102  3836219
2019 Oct 15 00:09:19.880463 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005004  rsvp-102-105@vmx102  0
2019 Oct 15 00:09:19.880469 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005004  Silver-102-101@vmx102  1041649
2019 Oct 15 00:09:19.880479 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005004  Silver-102-104@vmx102  3390530
2019 Oct 15 00:09:19.880483 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Traffic] msg=0x00005004  Silver-102-103@vmx102  4261408

2019 Oct 15 00:09:26.795447 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][<-AMQP] msg=0x00004018 exchange=controller.wan.stats routing_key=ns_link_latency
2019 Oct 15 00:09:26.795453 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Latency] msg=0x00007002  ge-0/1/8.0@vmx103 20.00 ms, packet_loss=0.00%
2019 Oct 15 00:09:26.795462 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Latency] msg=0x00007002  ge-0/0/6.0@vmx101 4.00 ms, packet_loss=0.00%
2019 Oct 15 00:09:26.795471 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Latency] msg=0x00007002  ge-0/0/5.0@vmx101 3.00 ms, packet_loss=0.00%
2019 Oct 15 00:09:26.795473 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Latency] msg=0x00007002  ge-0/1/1.0@vmx101 19.00 ms, packet_loss=0.00%
2019 Oct 15 00:09:26.795476 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Latency] msg=0x00007002  ge-0/1/9.0@vmx104 10.00 ms, packet_loss=0.00%
2019 Oct 15 00:09:26.795479 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][Latency] msg=0x00007002  ge-0/1/7.0@vmx104 0.00 ms, packet_loss=0.00%


2019 Oct 15 00:09:27.710072 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][<-AMQP] msg=0x00004018 exchange=controller.wan.stats routing_key=ns_demand
2019 Oct 15 00:09:27.710264 ns1-site1-q-pod21 PCServer  [Debug][PCServer] node:vmx102 prefix:10.0.0.101/32 bit_rate:0 demand_name=vmx102_10.0.0.101/32 to=10.0.0.101/32 SNMP_ifIndex:0 next_hope=
2019 Oct 15 00:09:27.710599 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][->pcs_tunnel_event] msg=0x00004002 LSP action, UPDATE id=3718607015 event=demand update
2019 Oct 15 00:09:27.710667 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][tunnelEvent] msg=0x00004027 LSP action, UPDATE id=3718607015 event=demand update
2019 Oct 15 00:09:27.710697 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][tunnelEvent] msg=0x0000400a   vmx102_10.0.0.101/32@10.0.0.102 pathname=10.0.0.101 to=10.0.0.101 bw=0 pri=7 pre=7 type=R,A2Z,PATH(10.0.0.101) path= op_state=ACTIVE ns_lsp_id =42 demand=true prefix=10.0.0.101/32
2019 Oct 15 00:09:27.710724 ns1-site1-q-pod21 PCServer  [Debug][PCServer] Redis Obj Save: Topology 1 OBJ: ns:1:pcs_lsp:id:int:obj 42 {buf} index:ns:1:pcs_lsp:indexes id_str:
2019 Oct 15 00:09:27.711440 ns1-site1-q-pod21 PCServer  [Debug][PCServer] Redis Obj Save: Done
2019 Oct 15 00:09:27.711450 ns1-site1-q-pod21 PCServer  [Debug][PCServer] node:vmx102 prefix:10.0.0.105/32 bit_rate:0 demand_name=vmx102_10.0.0.105/32 to=10.0.0.105/32 SNMP_ifIndex:0 next_hope=
2019 Oct 15 00:09:27.711454 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][->pcs_tunnel_event] msg=0x00004002 LSP action, UPDATE id=3718607015 event=demand update
2019 Oct 15 00:09:27.711457 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][tunnelEvent] msg=0x00004027 LSP action, UPDATE id=3718607015 event=demand update
2019 Oct 15 00:09:27.711461 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][tunnelEvent] msg=0x0000400a   vmx102_10.0.0.105/32@10.0.0.102 pathname=10.0.0.105 to=10.0.0.105 bw=0 pri=7 pre=7 type=R,A2Z,PATH(10.0.0.105) path= op_state=ACTIVE ns_lsp_id =44 demand=true prefix=10.0.0.105/32
2019 Oct 15 00:09:27.711464 ns1-site1-q-pod21 PCServer  [Debug][PCServer] Redis Obj Save: Topology 1 OBJ: ns:1:pcs_lsp:id:int:obj 44 {buf} index:ns:1:pcs_lsp:indexes id_str:
2019 Oct 15 00:09:27.712010 ns1-site1-q-pod21 PCServer  [Debug][PCServer] Redis Obj Save: Done
2019 Oct 15 00:09:27.712033 ns1-site1-q-pod21 PCServer  [Debug][PCServer] node:vmx102 prefix:10.0.0.103/32 bit_rate:0 demand_name=vmx102_10.0.0.103/32 to=10.0.0.103/32 SNMP_ifIndex:0 next_hope=
2019 Oct 15 00:09:27.712039 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][->pcs_tunnel_event] msg=0x00004002 LSP action, UPDATE id=3718607015 event=demand update
2019 Oct 15 00:09:27.712042 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][tunnelEvent] msg=0x00004027 LSP action, UPDATE id=3718607015 event=demand update
2019 Oct 15 00:09:27.712048 ns1-site1-q-pod21 PCServer  [NorthStar][PCServer][tunnelEvent] msg=0x0000400a   vmx102_10.0.0.103/32@10.0.0.102 pathname=10.0.0.103 to=10.0.0.103 bw=0 pri=7 pre=7 type=R,A2Z,PATH(10.0.0.103) path= op_state=ACTIVE ns_lsp_id =48 demand=true prefix=10.0.0.103/32
2019 Oct 15 00:09:27.712808 ns1-site1-q-pod21 PCServer  [Debug][PCServer] Redis Obj Save: Topology 1 OBJ: ns:1:pcs_lsp:id:int:obj 48 {buf} index:ns:1:pcs_lsp:indexes id_str:
2019 Oct 15 00:09:27.713209 ns1-site1-q-pod21 PCServer  [Debug][PCServer] Redis Obj Save: Done
2019 Oct 15 00:09:27.713219 ns1-site1-q-pod21 PCServer  [Debug][PCServer] node:vmx102 prefix:10.0.0.104/32 bit_rate:0 demand_name=vmx102_10.0.0.104/32 to=10.0.0.104/32 SNMP_ifIndex:0 next_hope=

Collecting NorthStar Controller Debug Files

If you are unable to resolve a problem with the NorthStar Controller, we recommend that you forward the debug files generated by the NorthStar Controller debugging utility to JTAC for evaluation. Currently all debug files are located in subdirectories under the u/wandl/tmp directory.

To collect debug files, log in to the NorthStar Controller CLI, and execute the command u/wandl/bin/system-diagnostic.sh filename.

The output is generated and is available from the /tmp directory in the filename.tbz2 debug file.

Remote Syslog

Most of NorthStar processes use rsyslog which is defined in /etc/rsyslog.conf. For a detailed information about using rsyslog, refer to http://www.rsyslog.com/doc for the specific rsyslog version running on your Linux system.

Increasing the Scale of SNMP Collection

  1. To increase the scale of SNMP collection within a polling interval of 5 minutes, perform the following tasks:

    1. By using a text editing tool like vi, open the supervisord_snmp_slave.conf file for editing.

      The configuration file opens.

      content_copy zoom_out_map
      vi opt/northstar/data/supervisord/supervisord_snmp_slave.conf
    2. Add the following command to increase the number of threads from 100 to 200:

      content_copy zoom_out_map
      /opt/northstar/thirdparty/python3/bin/celery -A collector.celery -Q netsnmp -n worker2@%%n worker -P threads -c 200--loglevel=info
    3. Add more workers (for example, worker3) by duplicating the preceding worker:

      content_copy zoom_out_map
      [program:worker3]
      /opt/northstar/thirdparty/python3/bin/celery -A collector.celery -Q netsnmp -n worker3@%%n worker -P threads -c 200--loglevel=info
      process_name=%(program_name)s
      numprocs=1
      ;directory=/tmp
      ;umask=022
      priority=999
      autostart=true
      autorestart=true
      startsecs=10
      startretries=3
      exitcodes=0,2
      stopsignal=TERM
      stopwaitsecs=10
      user=pcs
      stopasgroup=true
      killasgroup=true
      redirect_stderr=true
      stopasgroup=true
      stdout_logfile=/opt/northstar/logs/celery_worker3.msg
      stdout_logfile_maxbytes=10MB
      stdout_logfile_backups=10
      stdout_capture_maxbytes=10MB
      stderr_logfile=/opt/northstar/logs/celery_worker3.err
      stderr_logfile_maxbytes=10MB
      stderr_logfile_backups=10
      stderr_capture_maxbytes=10MB
      environment=PYTHONPATH="/opt/northstar/snmp-collector",LD_LIBRARY_PATH="/opt/northstar/lib"
      ;environment=A="1",B="2"
      ;serverurl=AUTO
    4. Add the workers in the group statement:

      Best Practice:

      The number of workers that you can add should be less than or equal to the number of cores in the CPU.

      content_copy zoom_out_map
      [group:collector]
      programs=worker1,worker2,worker3
    5. Restart the collector:* group in the supervisord:

      content_copy zoom_out_map
      supervisorctl reread
      supervisorctl update
    6. View the supervisorctl status of worker1, worker2, and worker3 to confirm that they are up and running:

      content_copy zoom_out_map
      supervisorctl status
    7. Ensure that you see a few worker1 processes in the output but only one parent process each for worker2 and worker3:

      content_copy zoom_out_map
      ps -ef | grep celery
  2. To increase the number of threads for higher scalability, perform the following tasks:

    1. By using a text editing tool like vi, open the data_gateway.py file for editing.

      The configuration file opens.

      content_copy zoom_out_map
      vi /opt/northstar/snmp-collector/collector/data_gateway.py
    2. Increase the number of threads in the pool from 10 to 20:

      content_copy zoom_out_map
      pool_size = 20
    3. Stop the collector_main:data_gateway process and restart the process:

      content_copy zoom_out_map
      supervisorctl stop collector_main:data_gateway
      supervisorctl restart collector_main:data_gateway
  3. To increase the throughput for higher scalability, perform the following tasks:

    1. By using a text editing tool like vi, open the es_publisher.cfg file for editing.

      The configuration file opens.

      content_copy zoom_out_map
      vi /opt/northstar/data/es_publisher/es_publisher.cfg
    2. Configure the following parameters:

      content_copy zoom_out_map
      polling_interval=5
      batch_size=5000
      pool_size=20
      Note:

      The maximum number of records to be sent in a single operation to the ElasticSearch database (batch_size) is 5000, while the maximum number of threads (in a thread pool) that can be run to collect SNMP statistics (pool_size) is 20.

  4. To collect data from more number of router interfaces per poll, perform the following tasks:

    1. Navigate to the Device Profile (Administration > Device Profile) page in the NorthStar Controller GUI.

    2. In the Device List, select a router and click Modify.

      The Modify Device(s) page appears.

    3. In the Name column of the User-defined Properties tab, specify the name of the property as bulk_size. In the Value column, configure the bulk size as 100.

      Bulk size indicates the number of interfaces collected each time the network is polled.

    4. Click Modify.

      You are redirected to the Device Profile page, where a confirmation message appears, indicating that the changes are saved.

external-footer-nav