Configuring Contrail Insights Alarms using Contrail Command
With Contrail Insights Alarms, you can configure an alarm to be generated when a condition is met in the infrastructure. Contrail Insights performs distributed analysis of metrics at the point of collection for efficient and responsive detection of events that match an alarm. Contrail Insights has two types of alarms:
Static | User-provided static threshold is used for comparison. |
Dynamic | Dynamically-learned adaptive threshold is used for comparison. |
For Contrail Insights releases prior to 3.2.6: In order to configure alarms, your Contrail Insights license subscription must be active.
Contrail Insights Alarms Overview
For both static and dynamic alarms, Contrail Insights Agent continuously collects measurements of metrics (see Metrics Collected by Contrail Insights) for different entities, such as hosts, instances, and network devices. Beyond simple collection, the agent also analyzes the stream of metrics at the time of collection to identify alarm rules that match. For a particular alarm, the agent aggregates the samples according to a user-specified function (average, standard deviation, min, max, sum) and produces a single measurement for each user-specified measurement interval. For a given measurement interval, the agent compares each measurement to a threshold. For an alarm with a static threshold, a measurement is compared to a fixed value using a user-specified comparison function (above, below, equal). For dynamic thresholds, a measurement is compared with a value learned by Contrail Insights over time.
You can further configure alarm parameters that require multiple intervals to match. This allows you to configure alarms to match sustained conditions, while also detecting performance over small time periods. Maximum values over a wide time range can be over-exaggerate conditions. Yet, averages can dilute the information. A balance is better achieved by measuring over small intervals and watching for repeated matches in multiple intervals. For example, to monitor CPU usage over a three-minute period, an alarm may be configured to compare average CPU utilization over fiveseconds intervals, yet only raise an alarm when 36 (or some subset of 36) intervals match the alarm condition. This provides better visibility into sustained performance conditions than a simple average or maximum over three minutes.
Dynamic thresholds enable outlier detection in resource consumption based on historical trends. Resource consumption may vary significantly at various hours of the day and days of the week. This makes it difficult to set a static threshold for a metric. For example, 70% CPU usage may be considered normal for Monday mornings between 10:00 AM and 12:00 PM, but the same amount of CPU usage may be considered abnormally high for Saturday nights between 9:00 PM and 10:00 PM.
With dynamic thresholds, Contrail Insights learns trends in metrics across all resources in scope to which an alarm applies. For example, if an alarm is configured for a host aggregate, Contrail Insights learns a baseline from metric values collected for hosts in that aggregate. Similarly, an alarm with a dynamic threshold configured for a project learns a baseline from metric values collected for instances in that project. Then, the agent generates an alarm when a measurement deviates from the baseline value learned for a particular time period.
When creating an alarm with a dynamic threshold, you select a metric, a period of time over which to establish a baseline, and the sensitivity to measurements that deviate from the baseline. The sensitivity can be configured as high, medium, or low. Higher sensitivity will report smaller deviations from the baseline and vice versa.
Contrail Insights Alarms Operation
Contrail Insights Agent performs distributed, real-time statistical analysis on a time-series data stream. Agent analyzes metrics over multiple measurement intervals using a configurable sliding window mechanism. An alarm is generated when the Contrail Insights Agent determines that metric data matches the alarm criteria over a configurable number of measurement intervals. The type of sample aggregation and the threshold for an alarm is configurable. Two types of alarms are supported: static and dynamic. The difference is how the threshold is determined and used to compare measured metric data. The following sections describe the overall sliding window analysis, and explains the details of static thresholds and dynamic baselines used by the analysis.
Sliding Window Analysis
Contrail Insights Agent evaluates alarms using sliding window analysis. The sliding window analysis compares a stream of metrics within a configurable measurement interval to a static threshold or dynamic baseline. The length of each measurement interval is configurable to one-second granularity. In each measurement interval, raw time-series data samples are combined using an aggregation function, such as average, max, and min. The aggregated value is compared against the static threshold or dynamic baseline using a configurable comparison function, such as above or below. Multiple measurement intervals comprise a sliding window. A configurable number of intervals in the sliding window must match the rule criteria for the agent to generate a notification for the alarm.
Figure 1 shows an example in which the sliding window consists of six adjacent measurement intervals (i1 to i6), as specified by the Interval Count parameter. In measurement interval i1, the average of samples S1, S2, S3 is computed as Savg. Depending on the alarm type static or dynamic, Savg is then compared with the configured static threshold or dynamically learned baseline using a user-specified comparison function such as above or below. The output of the comparison determines whether a specific measurement interval is marked as an interval with exception. This evaluation is repeated for each measurement interval within the sliding window (for example, i1 to i6).
In the example in Figure 1, the agent determines that two intervals, i2 and i5, are intervals with exception by comparing the aggregate value for the measurement interval with a static threshold or dynamic baseline, depending on alarm type. Assuming interval i1 is the first interval for which the alarm is configured, the alarm becomes active at end of interval i6, when Contrail Insights Agent determines that at least two out of the most recent six measurement intervals are marked as exceptions. When an alarm is configured using the Dashboard, Interval Count, and Intervals with Exception are set to 1 by default. As a result, the agent can generate an alarm after processing data for one measurement interval.
Static Alarm
A static alarm threshold is provided at the time of alarm definition. Figure 2 depicts an example of a static
alarm definition, followed by the equivalent JSON used for API configuration
of an alarm. The condition defined in the example is to evaluate an
average of host.cpu.usage
samples over
a 60 second measurement interval. The measured value is compared against
a static threshold of 80% to determine if a given measurement interval
matches the alarm rule. Figure 2 identifies
the components in a static alarm definition.
Dynamic Alarm
A dynamic alarm threshold is learned by Contrail Insights using historical data for the set of entities for which an alarm is configured. Figure 3 shows an example of a dynamic alarm definition and identifies the components in a dynamic alarm definition.
When using a dynamic threshold, you do not configure a static threshold value. Instead, you specify three parameters that control how the learning is performed. The learning algorithm produces a baseline across the entities. The baseline is comprised of a mean value and a standard deviation. The baseline is updated continuously as additional metric data is collected.
Following is a list of the three learning parameters and information about how they work:
BaselineAnalysisAlgorithm | Selects the machine learning algorithm used for determining the dynamic threshold. The following algorithms are available:
|
||||
LearningPeriodDuration | A dynamic baseline is determined using the historical data. This parameter determines the length of time period from which most recent historical data is used to compute a dynamic baseline. For example, 1 hour, 1 day, or 1 week. At the time of rule configuration, Contrail Insights might not yet have enough historical data for a given entity. In this case, learning is performed as data becomes available. Alarm evaluation begins after one Learning Period of data is available and baselines are generated. |
||||
Sensitivity | The sensitivity of a dynamic alarm controls the allowable magnitude of deviation from the learned mean. The sensitivity parameter controls a multiplier of the learned standard deviation. You can select low, medium, or high as sensitivity. Contrail Insights Agent compares real-time measurements to the range defined by:
|
Alarm Definition
Figure 2 shows an example of a static alarm definition. Every alarm definition has the following components shown in Table 1.
Item |
Options |
Description |
---|---|---|
Module |
Alarms, Service Alarms |
When Alarms is selected, you can configure alarms for entities such as hosts, instances, and network devices. When Service Alarms is selected, then you are able to configure alarms for services such as RabbitMQ, MySQL, ScaleIO, and OpenStack services. |
Alarm Rule Type |
Static, Dynamic |
This determines the type of threshold that alarm uses to determine if alarm should be generated or not. Following are the two types that are supported.
|
Name |
Alarm name |
A name identifies the alarm. Name is displayed in the Dashboard and is the user-facing identifier for external notification systems. |
Scope |
Host, Instance, Network Device, Virtual Network |
Type of entity such as host, instance, or network device to which the alarm applies. For example, if scope is selected as Instance, then you can further select to configure rule to all instances present in the infrastructure, or instances that are present in a specific project or an aggregate. |
Service |
RabbitMQ, MySQL, Ceph, OpenStack, Cassandra, Contrail, ScaleIO |
When selected, you can configure alarms for RabbitMQ, MySQL, Ceph, OpenStack, Cassandra, Contrail, and ScaleIO services. |
Metric Scope |
Cluster, Node, Queue |
Select the metric scope of what you want to monitor, such as cluster, node, or queue and then the metric to monitor. |
Object |
Options dependent on Metric Scope selection. |
Object that will be monitored. |
Generate |
Event, Alarm |
When conditions for the alarm are met, generate an event or alarm. |
For Metric |
cpu.usage, memory.usage |
Metrics that will be monitored. For example, host.cpu.usage or instance.cpu.usage. |
When |
Value |
— |
Interval (seconds) |
Value in seconds |
The duration of one measurement interval in seconds. Depending on the sampling frequency of a metric under observation, one or more raw samples might be received within an interval duration. All raw samples received within Interval duration are processed using aggregation functions such as average, sum, max, min, and std-dev. |
Is |
Value |
Example: When Value Is Above Threshold -8. Italics in example represent variables. |
Threshold |
Threshold value |
A numeric value to which measurements are compared. Contrail Insights supports two types of thresholds: static or dynamic.
Table 2 describes the required parameters for a dynamic alarm and the supported options. |
Baseline Analysis Algorithm |
k-means, ewma |
Table 2 describes these options. See Figure 6 and Figure 7 for baseline analysis examples |
Learning Period Duration |
1 week, 1 month |
Table 2 describes these options. |
Sensitivity |
Low, medium, high |
Table 2 describes these options. |
Severity |
None, information, warning, error, critical |
Indicates seriousness of the alarm. Critical indicates a major alarm. Information indicates a minor alarm. |
Advanced |
When selected, includes Intervals with Exception, Interval Count, and Status. |
— |
Aggregate/Project |
All hosts, all instances. AggregateId, ProjectId |
Select the set of entities an alarm will monitor. If Scope is Instance, then you can configure an alarm for the set of instances present in a specific project, aggregate, or all instances in the infrastructure. If Scope is Host, then you can configure an alarm for a set of hosts present in a specific aggregate or all hosts in the infrastructure. |
Alarm Mode |
Alert, Event |
Mode can be configured as an alert or event. |
Aggregation Function |
Average, Max, Min, Sum, Std-dev |
Determines how data samples received in one measurement interval are processed to generate an aggregated value for comparison. Agent collects multiple samples of a metric during a measurement interval. Agent combines the samples according to the aggregation function, in order to determine a single value for comparison with the threshold (static or dynamic) in a measurement interval. Table 5 lists and describes the aggregation functions for alarm processing. |
Comparison Function |
Above, Below, Equal, Increasing-at-a-minimum-rate-of, Decreasing-at-a-minimum-rate-of |
Determines how to compare output of the Aggregation Function with the static or dynamic threshold. Table 6 shows different comparison functions supported for Contrail Insights alarms. Figure 4 and Figure 5 show examples of the Comparison Function, showing both increases and decreases at a minimum rate. |
Static Threshold |
When alarm rule type is “static” |
— |
Alarm Severity |
None, information, warning, error, critical |
Indicates seriousness of the alarm. Critical indicates a major alarm. Information indicates a minor alarm. |
Notification |
None, PagerDuty, Custom Service, Service Now, Slack |
Methods of notification alerting you to conditions of operation. |
Intervals with Exception |
For example, “2” |
This is the minimum number of measurement intervals within the sliding window for which a condition for an alarm must be met to raise the alarm. In Figure 3, there are two Intervals with Exception: i2 and i5. When configuring an alarm in the Dashboard, Intervals with Exception is set to 1 by default. The Interval with Exception can be specified in the Dashboard by selecting Monitoring > Alarms > Add Rule. Intervals with Exception can not be greater than the Interval Count. |
Interval Count |
For example, “3” |
Maximum number of adjacent measurement intervals for which a statistical analysis is performed before deciding if an alarm is generated or not. In Figure 3, there are 6 measurement Intervals (i1 to i6) in the sliding window. Each measurement interval has duration specified by the Interval Duration parameter. When configuring an alarm in Dashboard, Interval Count is set to 1 by default. The Interval Count can be specified in the Dashboard by selecting Monitoring > Alarms > Add New Rule. |
Status |
Enable, Disable |
Used to set and also verify status of alarm rule. Set status as enabled or disabled. |
- Required Parameters for Dynamic Alarms
- States for Alarm Mode
- Aggregation Functions for Alarm Processing
- Comparison Functions for Alarm Processing
- Dynamic Baseline Examples
Required Parameters for Dynamic Alarms
Table 2 describes the required parameters for a dynamic alarm and the supported options.
Required Parameters for Dynamic Threshold |
Description |
Supported Options |
---|---|---|
Baseline Analysis Algorithm |
Baseline Analysis Algorithm is used to perform unsupervised learning on historical data. The baseline analysis is performed continuously as new data is received. |
|
Learning Period Duration |
The Learning Period Duration specifies the amount of historical data used by the Baseline Analysis Algorithm to determine a baseline. The dynamic baseline is continuously updated using data from the most recent Learning Duration. When a dynamic alarm is configured, baseline analysis is performed using data from the most recent Learning Duration, if available. If there is not sufficient data available, Contrail Insights Agent evaluates metrics as soon as enough data is present to learn the first set of baselines. Example: When Learning Duration is 1 day, the agent compares metrics to per-hour baselines for the last 24 hours. Example: When Learning Duration is 1 week, the agent compares metrics to per-hour baselines for the last 7 x 24 hours. |
|
Sensitivity |
The dynamic baseline provides a normal operating region of a given metric for a given scope. As seen in Figure 6, the dynamic baseline is a tuple which has mean and std-dev applicable for a specific hour of the day. The sensitivity factor determines what is the allowable band of operation. Measurements outside of the band of operation cause an interval with exception. For example, if the baseline mean is 20 and std-dev is 2, then normal operating region is between 18 and 22. When sensitivity is low then normal operating region is treated as 10 (mean - 5*std-dev) and 30 (mean + 5*std-dev). In this case, if the measured average of a metric is between 10 and 30, then no alarm is raised. In contrast, if the average is 5 or 35, then an alarm is raised. |
|
States for Alarm Mode
Table 3 shows all possible states for an alarm with the mode configured as alert.
State |
Description |
---|---|
Learning |
This is the initial state of each alarm. In this state, the alarm is processing real-time data and alarm stays in this state until sufficient data has been processed to make the decision about if an alarm should be generated or not. The duration of the learning period depends on the sliding window parameters. |
Active |
The condition specified by an alarm is met. Alarm will stay in this state as long as alarm conditions are satisfied. |
Inactive |
Condition specified by an alarm is not met. For example, after the learning state, the alarm transitions from active to inactive state because CPU usage was below the set threshold. |
Disabled |
Agent is not actively analyzing data for this alarm. The alarm is either deleted or temporarily disabled by the user. |
Table 4 shows all possible states for an alarm with the mode configured as event.
State |
Description |
---|---|
Enabled |
This is the initial state of the alarm with the mode set to Event when a rule is configured. It stays in this state until conditions are met to generate an alarm. |
Triggered |
When conditions for alarm generation are satisfied, then an alarm is generated with a state of triggered. Alarm generation is logged at the end of each measurement interval as long conditions for alarms continue to be met. |
Disabled |
Agent is not actively analyzing data for this alarm. The alarm is either deleted or has been temporarily disabled by the user. |
Aggregation Functions for Alarm Processing
Table 5 lists and describes the aggregation functions for alarm processing.
Aggregation Function |
Description |
---|---|
Average |
Statistical average of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Average during a 60 seconds interval is Above 80% of 2 of the last 3 measurement intervals. In this example, the measurement interval is 60 seconds. An alarm is generated if the average of the CPU usage samples exceeds 80% in any 2 measurement intervals out of 3 adjacent measurement intervals. |
Sum |
Sum of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Sum during a 60 seconds interval is Above 250% of 2 of the last 3 measurement intervals. In this example, An alarm is generated if the CPU usage sum is above 250% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Max |
Maximum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Max during a 60 seconds interval is Above 95% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the maximum CPU usage is above 95% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Min |
Minimum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Min during a 60 seconds interval is Below 5% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the minimum CPU usage is below 5% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Std-Dev |
Standard Deviation of the time-series data is determined based on the samples received until current measurement interval. Example: Generate Host Alert when Cpu-Usage std-dev during a 60 seconds interval is Above 2 sigma of 2 of the last 3 measurement intervals. In this example, the alarm is generated when the raw time series
samples are above |
Comparison Functions for Alarm Processing
Figure 4 and Figure 5 show examples of the Comparison Function, showing both increases and decreases at a minimum rate.
Table 6 shows different comparison functions supported for Contrail Insights alarms.
Comparison Operator |
Description |
---|---|
Above |
Determine if result of the aggregation function within a given measurement interval is above the threshold. Note:
For dynamic threshold above, Contrail Insights compares whether the result of the aggregation function is outside of the normal operating region (mean +/- sigma*sensitivity). |
Below |
Determine if result of the aggregation function determined for a given measurement interval is below the threshold. Note:
For dynamic threshold, below compares whether the result of aggregation function is within the normal operating region (mean +/- sigma*sensitivity). |
Equal |
Determine if result of the aggregation function is equal to the threshold. |
Increasing-at-a-minimum-rate-of |
This comparison function is useful when you are interested in tracking a sudden increase in the value of a given metric instead of its absolute value. For example, if ingress or egress network bandwidth starts increasing within short intervals then you might want to raise an alarm. Figure 4 shows sudden increase in metric average between measurement interval i1 and i2. Similarly, sudden increase is observed in metric average between measurement intervals i4 to i5. Example: Generate Host Alert when the host.network.ingress.bit_rate average during a 60 seconds interval is increasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean ingress bit rate increases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised. |
Decreasing-at-a-minimum-rate-of |
This comparison function is useful when you are interested in tracking sudden decrease in the value of a given metric instead of its absolute value. For example, egress network bandwidth starts decreasing within short intervals then you might want to raise an alarm to investigate the root cause. Figure 5 shows sudden decrease in metric average between measurement interval i1 and i2. Similarly, sudden decrease is observed in metric average between measurement intervals i3 and i4. Example: Generate Host Alert when the host.network.egress.bit_rate average during a 60 seconds interval is decreasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean egress bit rate decreases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised. |
Dynamic Baseline Examples
Figure 6 shows the dynamic
baseline computed by 24 hours of data and the k-means clustering algorithm.
For a given hour of the day, the blue dot is the mean
; the green bar is the mean + std-dev
;
the purple bar is mean - std-dev
.
Figure 7 shows the dynamic baseline computed by 24 hours of historical data using the EWMA algorithm. This baseline is used for the next 1 hour for alarm generation until it is updated again using the most recent 24 hours of data.
Configuring an Alarm Rule
To configure an alarm:
Select Monitoring > Alarms.
In the Alarm Rules panel, click Add Rule to create a new rule to trigger an alarm when a user-defined condition is met on one of the selected entities in the network.
Figure 8: Alarm Active Alerts and Alarm Rules Panel in Contrail CommandFor Module, select one of the following options. Based on your selection, the fields differ.
Alarms When Alarms is selected, you can configure alarms for entities such as hosts, instances, and network devices.
Service Alarms When Service Alarms is selected, then you are able to configure alarms for services in your environment, such as RabbitMQ, MySQL, ScaleIO, and OpenStack services.
Figure 9: Create and Configure an Alarm in Contrail CommandSelect Alarm Rule Type.
Static—When an alarm is defined as static, the rule definition should include a predefined static threshold determined by the user.
Dynamic—When alarm is defined as dynamic, the threshold is dynamically determined by the baseline algorithm, which can be either k-means or ewma.
Select the metric for the rule and specify interval when the rule should trigger an alarm. For other parameters, see Table 1 and descriptions in section "Alarm Definition."
Click Create to save the alarm.