- play_arrow Introduction
- play_arrow Feature Guide
- Using Contrail Insights APIs
- Aggregate of Network Device Entities
- Aggregate Discovery and Alarms with OpenStack Heat Services
- Alarms
- Composite Alarms
- Application Event Ingestion
- Capacity Planning
- Chargeback
- Charts
- Contrail Insights Platform Health
- Health Monitor
- Heat Map
- Contrail Insights with Kafka
- Metrics Collected by Contrail Insights
- Notifications
- OpenStack Nova Scheduler Service
- Extensibility Using Plug-Ins
- Reports
- Endpoint Monitoring with Service Groups
- Service Monitoring from the UI
- Contrail Insights VNF Monitoring
- Configure Network Devices from the UI
- Contrail Insights Auto Discovery of Network Devices from Contrail Networking
- Contrail Insights JTI (UDP) Monitoring
- Contrail Insights JTI (gRPC) Monitoring
- Contrail Insights SNMP Monitoring
- SNMP Traps in Contrail Insights
- Contrail Insights NETCONF CLI Monitoring
- Contrail Insights Network Device Monitoring Common Issues
- play_arrow Installing Contrail Insights
- Contrail Insights General Requirements
- Contrail Insights Agent Requirements
- Platform Dependencies
- Contrail Insights Installation for Kubernetes
- Contrail Insights Installation for NorthStar
- Contrail Insights Installation and Configuration for OpenStack
- Contrail Insights Installation for Ubuntu Focal
- Contrail Insights Installation for Containerized OpenStack (OpenStack Kolla, Red Hat OpenStack Platform 13)
- Contrail Insights Installation for OpenStack in HA
- Contrail Insights Installation for OpenStack Helm
- Contrail Insights Installation for Standalone
- Contrail Insights Settings
- play_arrow Ansible Configuration
- Bare Host
- Contrail Insights MultiCluster Mode
- Contrail Insights MultiCluster Proxy
- Contrail Insights User-Defined Plug-Ins
- Contrail Insights Port List
- Contrail Insights Role-Based Access
- Configure Network Device from JSON File
- Contrail Insights Plug-Ins
- Contrail Insights User-Defined Plug-Ins
- Instance Scope Plug-Ins
- Contrail Insights Object Plug-In
- Custom SNMP Plug-Ins
- Custom Sensors for JTI, gRPC, and NETCONF
- Remote Hosts
- Monitor NFX250 with Contrail Insights Agent
- Contrail Insights SDKs
- Contrail Insights with SSL (HTTPS) Enabled
- Service Monitoring Ansible Variables
- OpenStack Services Monitoring Using Service Group Profiles
- Ansible Configuration Variables
- play_arrow Downloads
Alarms
With AppFormix Alarms, you can configure an alarm to be generated when a condition is met in the infrastructure. AppFormix performs distributed analysis of metrics at the point of collection for efficient and responsive detection of events that match an alarm. AppFormix has two types of alarms:
Static—User-provided static threshold is used for comparison.
Dynamic—Dynamically-learned adaptive threshold is used for comparison.
Sections in this topic include:
AppFormix Alarms Overview
For both static and dynamic alarms, AppFormix Agent continuously collects measurements of metrics (see Metrics) for different entities, such as hosts, instances, and network devices. Beyond simple collection, the agent also analyzes the stream of metrics at the time of collection to identify alarm rules that match. For a particular alarm, the agent aggregates the samples according to a user-specified function (average, standard deviation, min, max, sum) and produces a single measurement for each user-specified measurement interval. For a given measurement interval, the agent compares each measurement to a threshold. For an alarm with a static threshold, a measurement is compared to a fixed value using a user-specified comparison function (above, below, equal). For dynamic thresholds, a measurement is compared with a value learned by AppFormix over time.
You can further configure alarm parameters that require multiple intervals to match. This allows you to configure alarms to match sustained conditions, while also detecting performance over small time periods. Maximum values over a wide time range can be over-exaggerate conditions. Yet, averages can dilute the information. A balance is better achieved by measuring over small intervals and watching for repeated matches in multiple intervals. For example, to monitor CPU usage over a three-minute period, an alarm may be configured to compare average CPU utilization over fiveseconds intervals, yet only raise an alarm when 36 (or some subset of 36) intervals match the alarm condition. This provides better visibility into sustained performance conditions than a simple average or maximum over three minutes.
Dynamic thresholds enable outlier detection in resource consumption based on historical trends. Resource consumption may vary significantly at various hours of the day and days of the week. This makes it difficult to set a static threshold for a metric. For example, 70% CPU usage may be considered normal for Monday mornings between 10:00 AM and 12:00 PM, but the same amount of CPU usage may be considered abnormally high for Saturday nights between 9:00 PM and 10:00 PM.
With dynamic thresholds, AppFormix learns trends in metrics across all resources in scope to which an alarm applies. For example, if an alarm is configured for a host aggregate, AppFormix learns a baseline from metric values collected for hosts in that aggregate. Similarly, an alarm with a dynamic threshold configured for a project learns a baseline from metric values collected for instances in that project. Then, the agent generates an alarm when a measurement deviates from the baseline value learned for a particular time period.
When creating an alarm with a dynamic threshold, you select a metric, a period of time over which to establish a baseline, and the sensitivity to measurements that deviate from the baseline. The sensitivity can be configured as high, medium, or low. Higher sensitivity will report smaller deviations from the baseline and vice versa.
AppFormix Alarms Operation
AppFormix Agent performs distributed, real-time statistical analysis on a time-series data stream. Agent analyzes metrics over multiple measurement intervals using a configurable sliding window mechanism. An alarm is generated when the AppFormix Agent determines that metric data matches the alarm criteria over a configurable number of measurement intervals. The type of sample aggregation and the threshold for an alarm is configurable. Two types of alarms are supported: static and dynamic. The difference is how the threshold is determined and used to compare measured metric data. The following sections describe the overall sliding window analysis, and explains the details of static thresholds and dynamic baselines used by the analysis.
Sliding Window Analysis
AppFormix Agent evaluates alarms using sliding window analysis. The sliding window analysis compares a stream of metrics within a configurable measurement interval to a static threshold or dynamic baseline. The length of each measurement interval is configurable to one-second granularity. In each measurement interval, raw time-series data samples are combined using an aggregation function, such as average, max, and min. The aggregated value is compared against the static threshold or dynamic baseline using a configurable comparison function, such as above or below. Multiple measurement intervals comprise a sliding window. A configurable number of intervals in the sliding window must match the rule criteria for the agent to generate a notification for the alarm.
![Alarm Generation Mechanics](/documentation/images/s043698.jpg)
Figure 1 shows an example in which the sliding window consists of six adjacent measurement intervals (i1 to i6), as specified by the Interval Count parameter. In measurement interval i1, the average of samples S1, S2, S3 is computed as Savg. Depending on the alarm type static or dynamic, Savg is then compared with the configured static threshold or dynamically learned baseline using a user-specified comparison function such as above or below. The output of the comparison determines whether a specific measurement interval is marked as an interval with exception. This evaluation is repeated for each measurement interval within the sliding window (for example, i1 to i6).
In the example in Figure 1, the agent determines that two intervals, i2 and i5, are intervals with exception by comparing the aggregate value for the measurement interval with a static threshold or dynamic baseline, depending on alarm type. Assuming interval i1 is the first interval for which the alarm is configured, the alarm becomes active at end of interval i6, when AppFormix Agent determines that at least two out of the most recent six measurement intervals are marked as exceptions. When an alarm is configured using the Dashboard, Interval Count, and Intervals with Exception are set to 1 by default. As a result, the agent can generate an alarm after processing data for one measurement interval.
Static Alarm
A static alarm threshold is provided at the time of alarm definition. Figure 2 depicts an example of a static
alarm definition, followed by the equivalent JSON used for API configuration
of an alarm. The condition defined in the example is to evaluate an
average of host.cpu.usage
samples over
a 60 second measurement interval. The measured value is compared against
a static threshold of 80% to determine if a given measurement interval
matches the alarm rule. Figure 2 identifies
the components in a static alarm definition.
![Static Alarm Definition](/documentation/images/s043699.jpg)
The following example shows the JSON equivalent to the static alarm definition shown in Figure 2:
"EventRule": { "Name": "Host-CPU-usage", "EventRuleType": "static", "EventRuleScope": "host", "MetricType": "cpu.usage", "Mode": "alert”, "AggregationFunction": "average", "IntervalDuration": "60", "ComparisonFunction": "above", "Threshold": 80, "IntervalsWithException": 2, "IntervalCount": 6, "DisplayEvent": true, "Status": "enabled", "Module": "alarms", "Severity": "warning", }
Dynamic Alarm
A dynamic alarm threshold is learned by AppFormix using historical data for the set of entities for which an alarm is configured. Figure 3 shows an example of a dynamic alarm definition, followed by the equivalent JSON used for API configuration of an alarm. Figure 3 identifies the components in a dynamic alarm definition.
![Dynamic Alarm Definition](/documentation/images/s043700.jpg)
The following example shows the JSON equivalent to the static alarm definition shown in Figure 3:
"EventRule": { "Name": "Host-CPU-usage", "EventRuleType": "dynamic", "EventRuleScope": "host", "MetricType": "cpu.usage", "Mode": "alert”, "AggregationFunction": "average", "IntervalDuration": "60", "ComparisonFunction": "above", “BaselineAnalysisAlgorithm”: “k-means”, “LearningPeriodDuration”: “1d”, “Sensitivity”: “medium”, "IntervalsWithException": 2, "IntervalCount": 6, "DisplayEvent": true, "Status": "enabled", "Module": "alarms", "Severity": "warning", }
When using a dynamic threshold, you do not configure a static threshold value. Instead, you specify three parameters that control how the learning is performed. The learning algorithm produces a baseline across the entities. The baseline is comprised of a mean value and a standard deviation. The baseline is updated continuously as additional metric data is collected.
Following is a list of the three learning parameters and information about how they work:
For example, a k-means algorithm may learn a dynamic baseline for 1:00 PM - 2:00 PM that may be 80% +/- 10%, whereas, the baseline between 3:00 AM - 4:00 AM may be 20% +/- 5%. An alarm is raised if the measured metric is 75% of the value between 3:00 AM - 4:00 AM, but the same measurement is acceptable during 1:00 PM - 2:00 PM time period.
For example, an EWMA algorithm can learn a dynamic baseline of 60% +/- 10% from data over the last 24 hours. This baseline is used for the next 1-hour interval to determine if real-time data deviates from the normal operating region. After every 1-hour interval, the EWMA baseline is updated and a new updated baseline is used for alarm generation in the future.
mean - sensitivity * std_dev < x < mean
+ sensitivity * std_dev
Alarm Definition
Figure 2 shows an example of a static alarm definition and is followed by the JSON for the same rule. Every alarm definition has the following components shown in Figure 4.
![Static Alarm Rule Configuration
Example](/documentation/images/s043708.png)
The listed components for alarm definition are numbered and described in the following text:
Static—When an alarm is defined as static, the rule definition should include a predefined static threshold. For example,
cpu.usage
static threshold can be 80%.Dynamic—When an alarm is defined as dynamic, the baseline is learned using historical data. Additional parameters are required such as baseline analysis algorithm, learning period duration, and sensitivity.
Alert—An alarm with the mode set to Alert has state. Events are generated and recorded only for changes in the state of the alarm. Table 1 shows all possible states for an alarm with the mode configured as alert. Figure 5 shows an example of different state transitions for an alarm for the cpu.usage metric with a static threshold of 50%.
Event—An alarm with the mode set to Event is evaluated similar to an alarm with the mode set to Alert. The key difference is that an alarm with the mode set to Event keeps generating notifications with a state of triggered for each interval in which the condition for the alarm is satisfied. When the conditions for an alarm are not satisfied, then the agent stops generating notifications about the alarm. As shown in Figure 6, an alarm with the mode set to Event generates significantly more notifications compared to an alarm with the mode set to alert.
![Alarm State
Transition with Mode as Alert for Cpu.usage Static Threshold = 50%](/documentation/images/s043701.png)
Table 1: States for Alarm Mode Defined as Alert
State | Description |
---|---|
Learning | This is the initial state of each alarm. In this state, the alarm is processing real-time data and alarm stays in this state until sufficient data has been processed to make the decision about if an alarm should be generated or not. The duration of the learning period depends on the sliding window parameters. Figure 5 shows the learning state when rule is configured in the system. |
Active | The condition specified by an alarm is met. Alarm will stay in this state as long as alarm conditions are satisfied. Figure 5 shows the active state when CPU usage is detected as 76.05%. |
Inactive | Condition specified by an alarm is not met. In Figure 5, after the learning state, the alarm transitions to inactive state because CPU usage was 13.5% (below the 50% threshold). The alarm transitions from active state to inactive state when CPU usage drops to 15.65%. |
Disabled | Agent is not actively analyzing data for this alarm. The alarm is either deleted or temporarily disabled by the user. |
![Alarm State Transition with Mode
as Event](/documentation/images/s043702.png)
Table 2: States for Alarm Mode Defined as Event
State | Description |
---|---|
Enabled | This is the initial state of the alarm with the mode set to Event when a rule is configured. It stays in this state until conditions are met to generate an alarm. Figure 6 shows state enabled is logged when alarm with mode as event is configured. |
Triggered | When conditions for alarm generation are satisfied, then an alarm is generated with a state of triggered. Alarm generation is logged at the end of each measurement interval as long conditions for alarms continue to be met. In Figure 6, seven alarm events are generated for the duration when cpu.usage stays above 50%. |
Disabled | Agent is not actively analyzing data for this alarm. The alarm is either deleted or has been temporarily disabled by the user. |
Table 3: Aggregation Functions for Alarm Processing
Aggregation Function | Description |
---|---|
Average | Statistical average of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Average during a 60 seconds interval is Above 80% of 2 of the last 3 measurement intervals. In this example, the measurement interval is 60 seconds. An alarm is generated if the average of the CPU usage samples exceeds 80% in any 2 measurement intervals out of 3 adjacent measurement intervals. |
Sum | Sum of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Sum during a 60 seconds interval is Above 250% of 2 of the last 3 measurement intervals. In this example, An alarm is generated if the CPU usage sum is above 250% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Max | Maximum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Max during a 60 seconds interval is Above 95% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the maximum CPU usage is above 95% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Min | Minimum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Min during a 60 seconds interval is Below 5% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the minimum CPU usage is below 5% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration. |
Std-Dev | Standard Deviation of the time-series data is determined based on the samples received until current measurement interval. Example: Generate Host Alert when Cpu-Usage std-dev during a 60 seconds interval is Above 2 sigma of 2 of the last 3 measurement intervals. In this example, the alarm is generated when the raw time series
samples are above |
![Comparison Function
Showing Increasing-at-a-minimum-rate-of](/documentation/images/s043703.png)
![Comparison Function
Showing Decreasing-at-a-minimum-rate-of](/documentation/images/s043704.png)
Table 4: Comparison Functions for Alarm Processing
Comparison Operator | Description |
---|---|
Above | Determine if result of the aggregation function within a given measurement interval is above the threshold. Note: For dynamic threshold above, AppFormix compares whether the result of the aggregation function is outside of the normal operating region (mean +/- sigma*sensitivity). |
Below | Determine if result of the aggregation function determined for a given measurement interval is below the threshold. Note: For dynamic threshold, below compares whether the result of aggregation function is within the normal operating region (mean +/- sigma*sensitivity). |
Equal | Determine if result of the aggregation function is equal to the threshold. |
Increasing-at-a-minimum-rate-of | This comparison function is useful when you are interested in tracking a sudden increase in the value of a given metric instead of its absolute value. For example, if ingress or egress network bandwidth starts increasing within short intervals then you might want to raise an alarm. Figure 7 shows sudden increase in metric average between measurement interval i1 and i2. Similarly, sudden increase is observed in metric average between measurement intervals i4 to i5. Example: Generate Host Alert when the host.network.ingress.bit_rate average during a 60 seconds interval is increasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean ingress bit rate increases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised. |
Decreasing-at-a-minimum-rate-of | This comparison function is useful when you are interested in tracking sudden decrease in the value of a given metric instead of its absolute value. For example, egress network bandwidth starts decreasing within short intervals then you might want to raise an alarm to investigate the root cause. Figure 8 shows sudden decrease in metric average between measurement interval i1 and i2. Similarly, sudden decrease is observed in metric average between measurement intervals i3 and i4. Example: Generate Host Alert when the host.network.egress.bit_rate average during a 60 seconds interval is decreasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean egress bit rate decreases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised. |
Static Threshold—A fixed value that is specified when an alarm is configured. For example host.cpu.usage above 90%, where 90% is the static threshold.
Dynamic Threshold—The threshold is learned dynamically by the system. Unsupervised learning is used to learn about historical trends to determine the dynamic threshold. For example, if an event rule is defined for Host aggregate, then the dynamic baseline is determined for the aggregate by applying the baseline analysis algorithm to data received from all member hosts of the aggregate. Figure 9 shows the dynamic baseline determined using the most recent 24-hour time frame of historical data and k-means clustering algorithm. This baseline is used for the next 24 hours for alarm generation while considering the hour of the day and its corresponding baseline mean and standard deviation. For example, on Tuesday 8:00 AM - 9:00 AM, a baseline computed for Monday 8:00 AM - 9:00 AM is used as a reference threshold for alarm generation.
Figure 9 shows the dynamic
baseline computed by 24 hours of data and the k-means clustering algorithm.
For a given hour of the day, the blue dot is the mean
; the green bar is the mean + std-dev
;
the purple bar is mean - std-dev
.
![Dynamic Baseline Determined
by Last 24 Hours of Data and K-Means Clustering Algorithm](/documentation/images/s043705.png)
Figure 10 shows the dynamic baseline computed by 24 hours of historical data using the EWMA algorithm. This baseline is used for the next 1 hour for alarm generation until it is updated again using the most recent 24 hours of data.
![Dynamic Baseline Determined
by Last 24 Hours of Historical Data Using EWMA](/documentation/images/s043706.png)
Figure 11 shows the mandatory parameters that must be specified to configure a dynamic alarm.
![Required Parameters
for the Dynamic Threshold in the Alarm Definition](/documentation/images/s043707.png)
Table 5 describes the required parameters for a dynamic alarm and the supported options.
Table 5: Required Parameters for Dynamic Alarm
Required Parameters for Dynamic Threshold | Description | Supported Options |
---|---|---|
Baseline Analysis Algorithm | Baseline Analysis Algorithm is used to perform unsupervised learning on historical data. The baseline analysis is performed continuously as new data is received. |
|
Learning Period Duration | The Learning Period Duration specifies the amount of historical data used by the Baseline Analysis Algorithm to determine a baseline. The dynamic baseline is continuously updated using data from the most recent Learning Duration. When a dynamic alarm is configured, baseline analysis is performed using data from the most recent Learning Duration, if available. If there is not sufficient data available, AppFormix Agent evaluates metrics as soon as enough data is present to learn the first set of baselines. Example: When Learning Duration is 1 day, the agent compares metrics to per-hour baselines for the last 24 hours. Example: When Learning Duration is 1 week, the agent compares metrics to per-hour baselines for the last 7 x 24 hours. |
|
Sensitivity | The dynamic baseline provides a normal operating region of a given metric for a given scope. As seen in Figure 9, the dynamic baseline is a tuple which has mean and std-dev applicable for a specific hour of the day. The sensitivity factor determines what is the allowable band of operation. Measurements outside of the band of operation cause an interval with exception. For example, if the baseline mean is 20 and std-dev is 2, then normal operating region is between 18 and 22. When sensitivity is low then normal operating region is treated as 10 (mean - 5*std-dev) and 30 (mean + 5*std-dev). In this case, if the measured average of a metric is between 10 and 30, then no alarm is raised. In contrast, if the average is 5 or 35, then an alarm is raised. |
|