Alarms
With Contrail Insights Alarms, you can configure an alarm to be generated when a condition is met in the infrastructure. Contrail Insights performs distributed analysis of metrics at the point of collection for efficient and responsive detection of events that match an alarm. Contrail Insights has two types of alarms:
Static—User-provided static threshold is used for comparison.
Dynamic—Dynamically-learned adaptive threshold is used for comparison.
Sections in this topic include:
Contrail Insights Alarms Overview
For both static and dynamic alarms, Contrail Insights Agent continuously collects measurements of metrics for different entities, such as hosts, instances, and network devices. Beyond simple collection, the agent also analyzes the stream of metrics at the time of collection to identify alarm rules that match. For a particular alarm, the agent aggregates the samples according to a user-specified function (average, standard deviation, min, max, sum) and produces a single measurement for each user-specified measurement interval. For a given measurement interval, the agent compares each measurement to a threshold. For an alarm with a static threshold, a measurement is compared to a fixed value using a user-specified comparison function (above, below, equal). For dynamic thresholds, a measurement is compared with a value learned by Contrail Insights over time.
You can further configure alarm parameters that require multiple intervals to match. This allows you to configure alarms to match sustained conditions, while also detecting performance over small time periods. Maximum values over a wide time range can be over-exaggerate conditions. Yet, averages can dilute the information. A balance is better achieved by measuring over small intervals and watching for repeated matches in multiple intervals. For example, to monitor CPU usage over a three-minute period, an alarm may be configured to compare average CPU utilization over fiveseconds intervals, yet only raise an alarm when 36 (or some subset of 36) intervals match the alarm condition. This provides better visibility into sustained performance conditions than a simple average or maximum over three minutes.
Dynamic thresholds enable outlier detection in resource consumption based on historical trends. Resource consumption may vary significantly at various hours of the day and days of the week. This makes it difficult to set a static threshold for a metric. For example, 70% CPU usage may be considered normal for Monday mornings between 10:00 AM and 12:00 PM, but the same amount of CPU usage may be considered abnormally high for Saturday nights between 9:00 PM and 10:00 PM.
With dynamic thresholds, Contrail Insights learns trends in metrics across all resources in scope to which an alarm applies. For example, if an alarm is configured for a host aggregate, Contrail Insights learns a baseline from metric values collected for hosts in that aggregate. Similarly, an alarm with a dynamic threshold configured for a project learns a baseline from metric values collected for instances in that project. Then, the agent generates an alarm when a measurement deviates from the baseline value learned for a particular time period.
When creating an alarm with a dynamic threshold, you select a metric, a period of time over which to establish a baseline, and the sensitivity to measurements that deviate from the baseline. The sensitivity can be configured as high, medium, or low. Higher sensitivity will report smaller deviations from the baseline and vice versa.
Contrail Insights Alarms Operation
Contrail Insights Agent performs distributed, real-time statistical analysis on a time-series data stream. Agent analyzes metrics over multiple measurement intervals using a configurable sliding window mechanism. An alarm is generated when the Contrail Insights Agent determines that metric data matches the alarm criteria over a configurable number of measurement intervals. The type of sample aggregation and the threshold for an alarm is configurable. Two types of alarms are supported: static and dynamic. The difference is how the threshold is determined and used to compare measured metric data. The following sections describe the overall sliding window analysis, and explains the details of static thresholds and dynamic baselines used by the analysis.
Sliding Window Analysis
Contrail Insights Agent evaluates alarms using sliding window analysis. The sliding window analysis compares a stream of metrics within a configurable measurement interval to a static threshold or dynamic baseline. The length of each measurement interval is configurable to one-second granularity. In each measurement interval, raw time-series data samples are combined using an aggregation function, such as average, max, and min. The aggregated value is compared against the static threshold or dynamic baseline using a configurable comparison function, such as above or below. Multiple measurement intervals comprise a sliding window. A configurable number of intervals in the sliding window must match the rule criteria for the agent to generate a notification for the alarm.
Figure 1 shows an example in which the sliding window consists of six adjacent measurement intervals (i1 to i6), as specified by the Interval Count parameter. In measurement interval i1, the average of samples S1, S2, S3 is computed as Savg. Depending on the alarm type static or dynamic, Savg is then compared with the configured static threshold or dynamically learned baseline using a user-specified comparison function such as above or below. The output of the comparison determines whether a specific measurement interval is marked as an interval with exception. This evaluation is repeated for each measurement interval within the sliding window (for example, i1 to i6).
In the example in Figure 1, the agent determines that two intervals, i2 and i5, are intervals with exception by comparing the aggregate value for the measurement interval with a static threshold or dynamic baseline, depending on alarm type. Assuming interval i1 is the first interval for which the alarm is configured, the alarm becomes active at end of interval i6, when Contrail Insights Agent determines that at least two out of the most recent six measurement intervals are marked as exceptions. When an alarm is configured using the Dashboard, Interval Count, and Intervals with Exception are set to 1 by default. As a result, the agent can generate an alarm after processing data for one measurement interval.
Static Alarm
A static alarm threshold is provided at the time of alarm definition. Figure 2 depicts an example of a static
alarm definition, followed by the equivalent JSON used for API configuration
of an alarm. The condition defined in the example is to evaluate an
average of host.cpu.usage
samples over
a 60 second measurement interval. The measured value is compared against
a static threshold of 80% to determine if a given measurement interval
matches the alarm rule. Figure 2 identifies
the components in a static alarm definition.
The following example shows the JSON equivalent to the static alarm definition shown in Figure 2:
"EventRule": { "Name": "Host-CPU-usage", "EventRuleType": "static", "EventRuleScope": "host", "MetricType": "cpu.usage", "Mode": "alert”, "AggregationFunction": "average", "IntervalDuration": "60", "ComparisonFunction": "above", "Threshold": 80, "IntervalsWithException": 2, "IntervalCount": 6, "DisplayEvent": true, "Status": "enabled", "Module": "alarms", "Severity": "warning", }
Dynamic Alarm
A dynamic alarm threshold is learned by Contrail Insights using historical data for the set of entities for which an alarm is configured. Figure 3 shows an example of a dynamic alarm definition, followed by the equivalent JSON used for API configuration of an alarm. Figure 3 identifies the components in a dynamic alarm definition.
The following example shows the JSON equivalent to the static alarm definition shown in Figure 3:
"EventRule": { "Name": "Host-CPU-usage", "EventRuleType": "dynamic", "EventRuleScope": "host", "MetricType": "cpu.usage", "Mode": "alert”, "AggregationFunction": "average", "IntervalDuration": "60", "ComparisonFunction": "above", “BaselineAnalysisAlgorithm”: “k-means”, “LearningPeriodDuration”: “1d”, “Sensitivity”: “medium”, "IntervalsWithException": 2, "IntervalCount": 6, "DisplayEvent": true, "Status": "enabled", "Module": "alarms", "Severity": "warning", }
When using a dynamic threshold, you do not configure a static threshold value. Instead, you specify three parameters that control how the learning is performed. The learning algorithm produces a baseline across the entities. The baseline is comprised of a mean value and a standard deviation. The baseline is updated continuously as additional metric data is collected.
Following is a list of the three learning parameters and information about how they work:
BaselineAnalysisAlgorithm | Selects the machine learning algorithm used for determining the dynamic threshold. The following algorithms are available:
|
||||
LearningPeriodDuration | A dynamic baseline is determined using the historical data. This parameter determines the length of time period from which most recent historical data is used to compute a dynamic baseline. For example, 1 hour, 1 day, or 1 week. At the time of rule configuration, Contrail Insights might not yet have enough historical data for a given entity. In this case, learning is performed as data becomes available. Alarm evaluation begins after one Learning Period of data is available and baselines are generated. |
||||
Sensitivity | The sensitivity of a dynamic alarm controls the allowable magnitude of deviation from the learned mean. The sensitivity parameter controls a multiplier of the learned standard deviation. You can select low, medium, or high as sensitivity. Contrail Insights Agent compares real-time measurements to the range defined by:
|
Alarm Definition
Figure 2 shows an example of a static alarm definition and is followed by the JSON for the same rule. Every alarm definition has the following components shown in Figure 4.
The listed components for alarm definition are numbered and described in the following text:
1. Name | A name identifies the alarm. Name is displayed in the Dashboard and is the user-facing identifier for external notification systems. |
||||||||||||||||||
2. Module | When Alarms is selected, you can configure alarms for entities such as hosts, instances, and network devices. When Service Alarms is selected, then you are able to configure alarms for services such as RabbitMQ, MySQL, ScaleIO, and OpenStack services. |
||||||||||||||||||
3. Alarm Rule Type | This determines the type of threshold that alarm uses to determine if alarm should be generated or not. Following are the two types that are supported.
|
||||||||||||||||||
4. Event Rule Scope | Type of entity such as host, instance, or network device to which the alarm applies. For example, if scope is selected as Instance, then you can further select to configure rule to all instances present in the infrastructure, or instances that are present in a specific project or an aggregate. |
||||||||||||||||||
5. Aggregate | Select the set of entities an alarm will monitor. If Scope is Instance, then you can configure an alarm for the set of instances present in a specific project, aggregate, or all instances in the infrastructure. If Scope is Host, then you can configure an alarm for a set of hosts present in a specific aggregate or all hosts in the infrastructure. |
||||||||||||||||||
6. Alarm Mode | Mode can be configured as an alert or event.
Figure 5: Alarm State
Transition with Mode as Alert for Cpu.usage Static Threshold = 50%
Figure 6: Alarm State Transition with Mode
as Event
|
7. Metric Name | Metrics Collected by Contrail Insights that will be monitored. For example, host.cpu.usage or instance.cpu.usage. |
8. Aggregation Function | Determines how data samples received in one measurement interval are processed to generate an aggregated value for comparison. Agent collects multiple samples of a metric during a measurement interval. Agent combines the samples according to the aggregation function, in order to determine a single value for comparison with the threshold (static or dynamic) in a measurement interval. Table 3 lists and describes the aggregation functions for alarm processing. |
Aggregation Function |
Description |
---|---|
Average |
Statistical average of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Average during a 60 seconds interval is Above 80% of 2 of the last 3 measurement intervals. In this example, the measurement interval is 60 seconds. An alarm is generated if the average of the CPU usage samples exceeds 80% in any 2 measurement intervals out of 3 adjacent measurement intervals. |
Sum |
Sum of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Sum during a 60 seconds interval is Above 250% of 2 of the last 3 measurement intervals. In this example, An alarm is generated if the CPU usage sum is above 250% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Max |
Maximum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Max during a 60 seconds interval is Above 95% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the maximum CPU usage is above 95% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Min |
Minimum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Min during a 60 seconds interval is Below 5% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the minimum CPU usage is below 5% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Std-Dev |
Standard Deviation of the time-series data is determined based on the samples received until current measurement interval. Example: Generate Host Alert when Cpu-Usage std-dev during a 60 seconds interval is Above 2 sigma of 2 of the last 3 measurement intervals. In this example, the alarm is generated when the raw time series
samples are above |
9. Comparison Function | Determines how to compare output of the Aggregation Function with the static or dynamic threshold. Table 4 shows different comparison functions supported for Contrail Insights alarms. Figure 7 and Figure 8 show examples of the Comparison Function, showing both increases and decreases at a minimum rate. |
Comparison Operator |
Description |
---|---|
Above |
Determine if result of the aggregation function within a given measurement interval is above the threshold. Note:
For dynamic threshold above, Contrail Insights compares whether the result of the aggregation function is outside of the normal operating region (mean +/- sigma*sensitivity). |
Below |
Determine if result of the aggregation function determined for a given measurement interval is below the threshold. Note:
For dynamic threshold, below compares whether the result of aggregation function is within the normal operating region (mean +/- sigma*sensitivity). |
Equal |
Determine if result of the aggregation function is equal to the threshold. |
Increasing-at-a-minimum-rate-of |
This comparison function is useful when you are interested in tracking a sudden increase in the value of a given metric instead of its absolute value. For example, if ingress or egress network bandwidth starts increasing within short intervals then you might want to raise an alarm. Figure 7 shows sudden increase in metric average between measurement interval i1 and i2. Similarly, sudden increase is observed in metric average between measurement intervals i4 to i5. Example: Generate Host Alert when the host.network.ingress.bit_rate average during a 60 seconds interval is increasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean ingress bit rate increases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised. |
Decreasing-at-a-minimum-rate-of |
This comparison function is useful when you are interested in tracking sudden decrease in the value of a given metric instead of its absolute value. For example, egress network bandwidth starts decreasing within short intervals then you might want to raise an alarm to investigate the root cause. Figure 8 shows sudden decrease in metric average between measurement interval i1 and i2. Similarly, sudden decrease is observed in metric average between measurement intervals i3 and i4. Example: Generate Host Alert when the host.network.egress.bit_rate average during a 60 seconds interval is decreasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean egress bit rate decreases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised. |
10. Threshold | A numeric value to which measurements are compared. Contrail Insights supports two types of thresholds: static or dynamic.
|
Figure 9 shows the dynamic
baseline computed by 24 hours of data and the k-means clustering algorithm.
For a given hour of the day, the blue dot is the mean
; the green bar is the mean + std-dev
;
the purple bar is mean - std-dev
.
Figure 10 shows the dynamic baseline computed by 24 hours of historical data using the EWMA algorithm. This baseline is used for the next 1 hour for alarm generation until it is updated again using the most recent 24 hours of data.
Figure 11 shows the mandatory parameters that must be specified to configure a dynamic alarm.
Table 5 describes the required parameters for a dynamic alarm and the supported options.
Required Parameters for Dynamic Threshold |
Description |
Supported Options |
---|---|---|
Baseline Analysis Algorithm |
Baseline Analysis Algorithm is used to perform unsupervised learning on historical data. The baseline analysis is performed continuously as new data is received. |
|
Learning Period Duration |
The Learning Period Duration specifies the amount of historical data used by the Baseline Analysis Algorithm to determine a baseline. The dynamic baseline is continuously updated using data from the most recent Learning Duration. When a dynamic alarm is configured, baseline analysis is performed using data from the most recent Learning Duration, if available. If there is not sufficient data available, Contrail Insights Agent evaluates metrics as soon as enough data is present to learn the first set of baselines. Example: When Learning Duration is 1 day, the agent compares metrics to per-hour baselines for the last 24 hours. Example: When Learning Duration is 1 week, the agent compares metrics to per-hour baselines for the last 7 x 24 hours. |
|
Sensitivity |
The dynamic baseline provides a normal operating region of a given metric for a given scope. As seen in Figure 9, the dynamic baseline is a tuple which has mean and std-dev applicable for a specific hour of the day. The sensitivity factor determines what is the allowable band of operation. Measurements outside of the band of operation cause an interval with exception. For example, if the baseline mean is 20 and std-dev is 2, then normal operating region is between 18 and 22. When sensitivity is low then normal operating region is treated as 10 (mean - 5*std-dev) and 30 (mean + 5*std-dev). In this case, if the measured average of a metric is between 10 and 30, then no alarm is raised. In contrast, if the average is 5 or 35, then an alarm is raised. |
|
11. Alarm Severity | Indicates seriousness of the alarm. Critical indicates a major alarm. Information indicates a minor alarm. |
12. Notification | Methods of notification alerting you to conditions of operation. |
13. Interval Duration | The duration of one measurement interval in seconds. Depending on the sampling frequency of a metric under observation, one or more raw samples might be received within an interval duration. All raw samples received within Interval duration are processed using aggregation functions such as average, sum, max, min, and std-dev. |
14. Intervals with Exception | This is the minimum number of measurement intervals within the sliding window for which a condition for an alarm must be met to raise the alarm. In Figure 3, there are two Intervals with Exception: i2 and i5. When configuring an alarm in the Dashboard, Intervals with Exception is set to 1 by default. The Interval with Exception can be specified in the Dashboard by selecting Alarms > Add New Rule. Then select Advanced to view the Advanced settings. Intervals with Exception can not be greater than the Interval Count. |
15. Interval Count | Maximum number of adjacent measurement intervals for which a statistical analysis is performed before deciding if an alarm is generated or not. In Figure 3, there are 6 measurement Intervals (i1 to i6) in the sliding window. Each measurement interval has duration specified by the Interval Duration parameter. When configuring an alarm in Dashboard, Interval Count is set to 1 by default. The Interval Count can be specified in the Dashboard by selecting Alarms > Add New Rule. Then select Advanced to view the Advanced settings. |
16. Status | Used to set and also verify status of alarm rule. Set status as enabled or disabled. |