Alert
An alert is a set of consecutive named queries calculated on a regular basis.
A set of queries is calculated once a minute. The resulting query value specified in the settings is compared to the preset threshold values.
If the result of the query specified in the settings reaches the preset threshold value, Monitoring changes the alert status to Alarm
or Warning
and notifies the user via a notification channel.
Alert statuses
An alert can have one of the following statuses:
Status | Description |
---|---|
OK |
The metric value is within the specified normal threshold. |
Warning |
The metric value has reached the Warning threshold. |
Alarm |
The metric value has reached the Alarm critical status threshold. |
No data |
Lack of metric data to calculate the alert function. |
Error |
The alert value cannot be calculated. |
Alert evaluation history
Alert evaluation history is represented as a chart that consists of columns colored depending on the alert status as of its calculation.
To navigate through history, you can choose one of the preset output scales:
1h
: 1 hour1d
: 1 day1w
: 1 week1m
: 1 month
The minimum history output scale is 1h
: each column in a chart shows the alert status for the respective minute. For big output scales, the column color is made up of the statuses calculated within the interval.
By clicking a column, you bring up the alert settings information as of the selected evaluation point.
Note
When drawing data from the evaluation history, the alert status is re-evaluated and presented in the Evaluation status field. The alert status in the history may differ from the current evaluation result due to the specifics of historical data decimation or delays in data delivery to Monitoring.
Alert settings
Alert settings are configured when creating an alert. You can edit them after you save the alert.
Queries
This is a set of queries that return a line or multiple lines.
You can:
- Disable query calculation by clicking
and selecting Deactivate. Links to queries that are not calculated will return errors. - Hide query calculation results from the chart by clicking
. - Display query calculation results on the chart by clicking
.
Alert triggers
Test query
Name of query to whose calculation result an aggregation function applies.
Aggregation function
Aggregation function is applied to the test query calculation results.
Aggregation function | Description |
---|---|
At least one value | At least one metric value in the query exceeds the thresholds set in the specified period. |
All values | All metric values in the query exceed the thresholds set in the specified period. |
Average | Calculates an average value for each metric in the specified period. For example, if a query returns two metrics, Monitoring calculates an average value for each of them in the specified window. |
Count | Calculates the number of metric values in the specified period. |
Last value | Uses the latest metric value in the specified period. If Yandex Monitoring could not obtain the metric value, it changes the alert status to No data . |
Maximum | Uses the maximum metric value in the specified period. |
Minimum | Uses the minimum metric value in the specified period. |
Sum | Calculates the sum of values for each metric in the specified period. |
For example, to track the latest metric value within the last 15 minutes, select the Last function and set the evaluation window to 15m
.
Comparison function
Comparison functions are applied to aggregation function calculation results and the Warning and Alarm threshold values. If an aggregated value matches the threshold one, Monitoring changes the alert status.
Warning
Threshold value upon which the alert status changes to Warning
.
Alarm
Threshold value upon which the alert status changes to Alarm
.
Evaluation window
Time interval for which the aggregation function is calculated. The window allows to exclude sudden changes in metric values by only responding to changes over a longer period.
You can select a preset value or specify your own in the following format:
1h
: 1 hour1m
: 1 minute1s
: 1 second
For example, 3m 45s
sets an evaluation window of 3 minutes 45 seconds.
Evaluation delay
Back-shift of the time window in seconds. The default value is 0. Allows avoiding a situation when an alert is triggered unexpectedly, if a query uses metrics collected at a different interval. You can select a preset value or specify your own, same as for the evaluation window.
No data policy
Policies set the alert status when there is no data or metrics in the evaluation window for a given criterion. Policies are applied prior to verifying trigger conditions.
When there are no metrics or points in an evaluation window, you can handle this scenario in two ways:
- Change the alert status using the No metrics by selector or No points in evaluation window policies.
- Handle it manually.
Note
To make alerts switch to the No data
status when there are no metrics or points in the evaluation window, set the policy value for all alert types to No data
. Avoid using the Default
and Manual
values as they require extra manual handling.
No selector metrics
The policy determines the alert status if no metrics were found for at least one selector. For example, these metrics do not exist or they were deleted after their TTL expired.
The possible values are:
-
Default
:No data
for all alert types. -
OK
: Changes the alert status toOK
. -
Warn
: Changes the alert status toWarning
. -
Alarm
: Changes the alert status toAlarm
. -
No data
: Changes the alert status toNo data
. -
Manual
: Supported only for alerts with theExpression
type; gives control to the alert program for manual handling.If no metrics were found for such an alert for at least one selector and no status management function was triggered, the alert will switch to the
OK
state.
No points in evaluation window
The policy determines the alert status if at least one of the metrics in the evaluation window has no points.
For threshold alerts that monitor several metrics, predicates are checked for each metric independently. The final alert status is an aggregation of statuses for each of the metrics in the following order: No data
< OK
< Warning
< Error
< Alarm
. If there is a line that has no points, and the No points in evaluation window
policy changes the status of a threshold alert to Warning
, while for another line, the predicate that changes the alert to Alarm
is true, the resulting alert status will be Alarm
.
The possible values are:
Default
:No data
for all threshold alerts andManual
forExpression
alerts.OK
: Changes the alert status toOK
.Warning
: Changes the alert status toWarning
.Alarm
: Changes the alert status toAlarm
.No data
: Changes the alert status toNo data
.Manual
: Gives control to the predicates or alert program for manual handling.
If the No points in evaluation window
parameter is set to Manual
or Default
for an Expression
alert, while its evaluation window has no points and no status management function has been triggered in the alert program, the alert status will change to OK
. We recommend using the No data
value for the No points in evaluation window
policy. If you set the policy value to Manual
or Default
, the no-points situation is handled manually.
Manual processing of no data
Setting the Manual
value for any policy will give control to the alert predicates or program.
Avoid the Manual
value as it complicates the alert program. The No data
policy value should cover most cases.
No metrics
To manually handle Expression
alerts with no metrics, you can use the size
function and status management functions: you will get the number of metrics for the selector you specify.
An example of a program that will change the alert to the OK
status if the selector specified in the xx5
variable has retrieved 0 metrics:
let xx5 = {
project="solomon",
cluster="production",
service="gateway",
endpoint="*",
code="5*",
method="*",
host="cluster",
name="http.server.requests.status"
};
...
ok_if(size(xx5) == 0);
...
No points
To manually handle the no-points situation, use the count
function and the status management functions: you will get the number of points for a specified selector in the evaluation window.
Example of a program that will change the alert to No data
if there are no points for the metric specified in the source
variable within the alert's evaluation window:
let source = {
project="solomon",
cluster="production",
service="alerting",
endpoint="api.telegram.org/send*",
host="cluster",
name="http.client.call.inFlight"
};
...
no_data_if(count(source) == 0);
...