Contrail Networking Analytics
Overview: Analytics
Analytics is an optional feature set in Juniper® Cloud-Native Contrail Networking (CN2) Release 22.1. The analytics are packaged separately from the CN2 core Container Network Interface (CNI) components. Analytics also has its own installation procedure. The package consists of a combination of open-source software and Juniper developed software that integrates with CN2.
The analytics features fit into the following high-level functional areas:
-
Metrics—Statistical time series data collected from the Contrail Networking components and the base Kubernetes system
-
Flow and Session Records—Network traffic information collected from the CN2 vRouter
-
Sandesh User Visible Entities (UVE)—Records representing the system-wide state of externally visible objects that are collected from the CN2 vRouter and control node components
-
Logs—Log messages collected from Kubernetes pods
-
Introspect—A diagnostic utility that provides the ability to browse the internal state of the CN2 components
Metrics
Data Model
Metric information is based on a numerical time series data model. Each data point in a
series is a sample of some system state that gets collected at a regular interval. A sampled
value is recorded along with a timestamp at which the collection occurred. A sample record
can also contain an optional set of key-value pairs called labels. Labels provide a
dimension capability for metrics where a given combination of labels for the same metric
name identifies a particular dimensional instantiation of that metric. For example, a metric
named api_http_requests_total
can utilize labels to provide visibility into
the request counts at a URL and method type level. In the following example, the metric
record for a sample value of 10 includes labels that indicate the type of request.
api_http_requests_total{method="POST", handler="/messages"} 10
Metric Data Types
Although all metric sample values are just numbers, the concept of data type exists within this numerical data model. A metric can be one of the following types:
-
Counter—A cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
-
Gauge—A metric that represents a single numerical value that can arbitrarily go up and down.
-
Histogram—A histogram samples observations (such as request durations or response sizes) and counts them in configurable buckets. The histogram also provides a sum of all observed values.
-
Summary—Similar to a histogram, a summary samples observations (such as request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, the summary calculates configurable quantiles over a sliding time window.
The metric functionality in CN2 is implemented by Prometheus. For additional details about the metric data model, see the documentation at Prometheus.
Supported Metrics
The analytics solution supports the following sets of metrics:
-
Contrail Networking Metric List—Metrics collected from the vRouter and control node components.
-
Kubernetes Metric List—Metrics collected from various Kubernetes components, such as
apiserver
,etcd
,kubelet
, and so on. -
Cluster Node Metrics—Host-level metrics collected from the Kubernetes cluster nodes.
Alerts
Alerts are generated based on an analysis of collected metric data. Every supported alert type is based on a rule definition that contains the following information:
-
Alert Name—A unique string identifier for the alert type
-
Condition Expression—A Prometheus query language expression that gets evaluated against collected metric values to determine if the alert condition exists
-
Condition Duration—The amount of time the problematic condition has to exist for the alert to be generated
-
Severity—The alert level (critical, major, warning, info.)
-
Summary—A short description of the problematic condition
-
Description—A detailed description of the problematic condition
The CN2 analytics solution installs a set of predefined alert rules. You can also define your own custom alert rules. The creation of PrometheusRule Kubernetes resources in the namespace where the analytics Helm chart is deployed supports defining custom alerts. Following is an example of a custom alert rule.
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: acme-corp-rules spec: groups: - name: acme-corp.rules rules: - alert: HostUnusualNetworkThroughputOut expr: "sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100" labels: severity: warning annotations: summary: "Host unusual network throughput out (instance {{ $labels.instance }})" description: "Host network interfaces are sending too much data (> 100 MB/s)\n VALUE = {{ $value }}"
Prometheus stores generated alerts as records that can be viewed in the Grafana UI. The AlertManager component supports integration with external systems, such as PagerDuty, OpsGenie, or email for alert notification.
Architecture
As shown in Figure 1, Prometheus is the core component of the metrics architecture. Prometheus implements the following functionality:
-
Collection—A periodic polling mechanism that invokes API calls against other components (exporters) to pull values for a set of metrics
-
Storage—A time series database that provides persistence for the metrics collected from the exporters
-
Query—An API supporting an expression language called PromQL (Prometheus query language) that allows the historical metric information to be retrieved from the database
-
Alerting—A framework providing an ability to define rules that produce alerts when certain conditions are observed in the collected metric data
The other components of the metrics architecture are:
-
Grafana—A service that provides a Web UI interface allowing the user to visualize the metric data in graphs.
-
AlertManager—An integration service that notifies external systems of alerts generated by Prometheus.
Configuration
The metrics functionality does not require any configuration by the end user. The installation of analytics takes care of configuring Prometheus to collect metrics from the exporters that provide all of the metrics described in Supported Metrics. A group of default alerting rules is also automatically set up as part of the installation. You can extend functionality through additional configuration after the installation. For example, you can define customer-specific alerting rules. You can also configure the AlertManager to integrate with any of the supported external systems in your environment.
The configuration of Prometheus and AlertManager involves an additional architectural component called the Prometheus Operator. As shown in Figure 2, configuration is specified as Kubernetes custom resources. The Prometheus Operator translates the contents of these resources into the native configuration that the Prometheus components recognize. The Operator also updates the components accordingly and restarts them whenever a configuration change requires a restart.
Documentation for the full set of resources that the Prometheus Operator supports is available at Prometheus Operator API. Juniper Networks recommends that you limit your configurations to the subset of resource types related to alert rule definition and external system integration.
Grafana
The main UI for viewing metric data and alerts is Grafana. The analytics installation sets up the Grafana service and configures it with Prometheus as a data source. A set of default dashboards are also created.
Access the Grafana Web UI at https://<k8sClusterIP>/grafana/login
.
The default login credentials are user admin
and password
prom-operator
.