Add "user_was_notified_of_alert_groups" metric (#2334)

This PR adds new metric for Prometheus exporter
"user_was_notified_of_alert_groups" which counts how many alert groups
user was notified of.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
This commit is contained in:
Yulya Artyukhina 2023-06-28 10:15:19 +02:00 committed by GitHub
parent 8f5441b4c6
commit f1fcb41fb4
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
12 changed files with 522 additions and 74 deletions

View file

@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased
### Added
- Add metric "how many alert groups user was notified of" to Prometheus exporter ([#2334](https://github.com/grafana/oncall/pull/2334/))
## v1.3.2
### Changed

View file

@ -21,6 +21,8 @@ It is a gauge, and its name has the suffix `alert_groups_total`
- Response time on alert groups for each integration (mean time between the start and first action of all alert groups
for the last 7 days in selected period). It is a histogram, and its name has the suffix `alert_groups_response_time`
with the histogram suffixes such as `_bucket`, `_sum` and `_count`
- A total count of alert groups users were notified of for each user. It is a counter, and its name has the suffix
`user_was_notified_of_alert_groups_total`
You can find more information about metrics types in the [Prometheus documentation](https://prometheus.io/docs/concepts/metric_types).
@ -31,15 +33,17 @@ To retrieve Prometheus metrics use PromQL. If you are not familiar with PromQL,
OnCall application metrics are collected in preinstalled `grafanacloud_usage` datasource and are available for every
cloud instance.
Metrics have prefix `grafanacloud_oncall_instance`, e.g. `grafanacloud_oncall_instance_alert_groups_total` and
`grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket`.
Metrics have prefix `grafanacloud_oncall_instance`, e.g. `grafanacloud_oncall_instance_alert_groups_total`,
`grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket` and
`grafanacloud_oncall_instance_user_was_notified_of_alert_groups_total`.
### For open source customers
To collect OnCall application metrics you need to set up Prometheus and add it to your Grafana instance as a datasource.
You can find more information about Prometheus setup in the [OSS documentation](https://github.com/grafana/oncall#readme)
Metrics will have the prefix `oncall`, e.g. `oncall_alert_groups_total` and `oncall_alert_groups_response_time_seconds_bucket`.
Metrics will have the prefix `oncall`, e.g. `oncall_alert_groups_total`, `oncall_alert_groups_response_time_seconds_bucket`
and `oncall_user_was_notified_of_alert_groups_total`.
Your metrics may also have additional labels, such as `pod`, `instance`, `container`, depending on your Prometheus setup.
@ -86,17 +90,37 @@ in Grafana stack "test_stack":
grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket{slug="test_stack", integration="Grafana Alerting", le="600"}
```
### Metric Alert groups user was notified of
This metric has the following labels:
| Label Name | Description |
|---------------|:-----------------------------------------------------------------------------:|
| `id` | ID of Grafana instance (stack) |
| `slug` | Slug of Grafana instance (stack) |
| `org_id` | ID of Grafana organization |
| `username` | User username |
**Query example:**
Get the number of alert groups user with username "alex" was notified of in Grafana stack "test_stack":
```promql
grafanacloud_oncall_instance_user_was_notified_of_alert_groups_total{slug="test_stack", username="alex"}
```
### Dashboard
To import OnCall metrics dashboard go to `Administration` -> `Plugins` page, find OnCall in the plugins list, open
`Dashboards` tab at the OnCall plugin settings page and click "Import" near "OnCall metrics". After that you can find
the "OnCall metrics" dashboard in your dashboards list. In the datasource dropdown select your Prometheus datasource
(for Cloud customers it's `grafanacloud_usage`). You can filter data by your Grafana instances, teams and integrations.
You can find the "OnCall Metrics" dashboard in the list of your dashboards in the folder `General`, it has the tag
`oncall`. In the datasource dropdown select your Prometheus datasource (for Cloud customers it's `grafanacloud_usage`).
You can filter data by your Grafana instances, teams and integrations.
To update the dashboard to the newest version go to `Dashboards` tab at the OnCall plugin settings page and click
“Re-import”.
Be aware: if you have made changes to the dashboard, they will be deleted after re-importing. To save your changes go
to the dashboard settings, click "Save as" and save a copy of the dashboard.
To re-import OnCall metrics dashboard go to `Administration` -> `Plugins` page, find OnCall in the plugins list, open
`Dashboards` tab at the OnCall plugin settings page and click "Re-import" near "OnCall Metrics". After that you can find
the "OnCall Metrics" dashboard in your dashboards list.
Be aware: if you have made changes to the dashboard, they will be lost after re-importing or after the plugin update.
To save your changes go to the "OnCall Metrics" dashboard settings, click "Save as" and save a copy of the dashboard.
## Insight Logs

View file

@ -9,6 +9,7 @@ from kombu.utils.uuid import uuid as celery_uuid
from apps.alerts.constants import NEXT_ESCALATION_DELAY
from apps.alerts.signals import user_notification_action_triggered_signal
from apps.base.messaging import get_messaging_backend_from_id
from apps.metrics_exporter.helpers import metrics_update_user_cache
from apps.phone_notifications.phone_backend import PhoneBackend
from common.custom_celery_tasks import shared_dedicated_queue_retry_task
@ -186,6 +187,17 @@ def notify_user_task(
notification_channel=notification_policy.notify_by,
)
if log_record: # log_record is None if user notification policy step is unspecified
# if this is the first notification step, and user hasn't been notified for this alert group - update metric
if (
log_record.type == UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_TRIGGERED
and previous_notification_policy_pk is None
and not user.personal_log_records.filter(
type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_TRIGGERED,
alert_group_id=alert_group_pk,
).exists()
):
metrics_update_user_cache(user)
log_record.save()
if notify_user_task.request.retries == 0:
transaction.on_commit(lambda: send_user_notification_signal.apply_async((log_record.pk,)))

View file

@ -1,6 +1,8 @@
import datetime
import typing
from django.conf import settings
class AlertGroupsTotalMetricsDict(typing.TypedDict):
integration_name: str
@ -25,6 +27,14 @@ class AlertGroupsResponseTimeMetricsDict(typing.TypedDict):
response_time: list
class UserWasNotifiedOfAlertGroupsMetricsDict(typing.TypedDict):
user_username: str
org_id: int
slug: str
id: int
counter: int
class RecalculateMetricsTimer(typing.TypedDict):
recalculate_timeout: int
forced_started: bool
@ -35,6 +45,9 @@ class RecalculateOrgMetricsDict(typing.TypedDict):
force: bool
METRICS_PREFIX = "oncall_" if settings.IS_OPEN_SOURCE else "grafanacloud_oncall_instance_"
USER_WAS_NOTIFIED_OF_ALERT_GROUPS = METRICS_PREFIX + "user_was_notified_of_alert_groups"
ALERT_GROUPS_TOTAL = "oncall_alert_groups_total"
ALERT_GROUPS_RESPONSE_TIME = "oncall_alert_groups_response_time_seconds"

View file

@ -17,8 +17,11 @@ from apps.metrics_exporter.constants import (
METRICS_RECALCULATION_CACHE_TIMEOUT,
METRICS_RECALCULATION_CACHE_TIMEOUT_DISPERSE,
METRICS_RESPONSE_TIME_CALCULATION_PERIOD,
USER_WAS_NOTIFIED_OF_ALERT_GROUPS,
AlertGroupsResponseTimeMetricsDict,
AlertGroupsTotalMetricsDict,
RecalculateMetricsTimer,
UserWasNotifiedOfAlertGroupsMetricsDict,
)
if typing.TYPE_CHECKING:
@ -46,6 +49,27 @@ def get_organization_ids():
return organizations_ids
def is_allowed_to_start_metrics_calculation(organization_id, force=False) -> bool:
"""Check if metrics_cache_timer doesn't exist or if recalculation was started by force."""
recalculate_timeout = get_metrics_recalculation_timeout()
metrics_cache_timer_key = get_metrics_cache_timer_key(organization_id)
metrics_cache_timer = cache.get(metrics_cache_timer_key)
if metrics_cache_timer:
if not force or metrics_cache_timer.get("forced_started", False):
return False
else:
metrics_cache_timer["forced_started"] = True
else:
metrics_cache_timer: RecalculateMetricsTimer = {
"recalculate_timeout": recalculate_timeout,
"forced_started": force,
}
metrics_cache_timer["recalculate_timeout"] = recalculate_timeout
cache.set(metrics_cache_timer_key, metrics_cache_timer, timeout=recalculate_timeout)
return True
def get_response_time_period() -> datetime.datetime:
"""Returns period for response time calculation"""
return timezone.now() - METRICS_RESPONSE_TIME_CALCULATION_PERIOD
@ -87,6 +111,14 @@ def get_metric_alert_groups_response_time_key(organization_id) -> str:
return f"{ALERT_GROUPS_RESPONSE_TIME}_{organization_id}"
def get_metric_user_was_notified_of_alert_groups_key(organization_id) -> str:
return f"{USER_WAS_NOTIFIED_OF_ALERT_GROUPS}_{organization_id}"
def get_metric_calculation_started_key(metric_name) -> str:
return f"calculation_started_for_{metric_name}"
def metrics_update_integration_cache(integration: "AlertReceiveChannel") -> None:
"""Update integration data in metrics cache"""
metrics_cache_timeout = get_metrics_cache_timeout(integration.organization_id)
@ -239,3 +271,25 @@ def metrics_update_alert_groups_response_time_cache(integrations_response_time,
continue
integration_response_time_metrics["response_time"].extend(integration_response_time)
cache.set(metric_alert_groups_response_time_key, metric_alert_groups_response_time, timeout=metrics_cache_timeout)
def metrics_update_user_cache(user):
"""Update "user_was_notified_of_alert_groups" metric cache."""
metrics_cache_timeout = get_metrics_cache_timeout(user.organization_id)
metric_user_was_notified_key = get_metric_user_was_notified_of_alert_groups_key(user.organization_id)
metric_user_was_notified: typing.Dict[int, UserWasNotifiedOfAlertGroupsMetricsDict] = cache.get(
metric_user_was_notified_key, {}
)
metric_user_was_notified.setdefault(
user.id,
{
"user_username": user.username,
"org_id": user.organization.org_id,
"slug": user.organization.stack_slug,
"id": user.organization.stack_id,
"counter": 0,
},
)["counter"] += 1
cache.set(metric_user_was_notified_key, metric_user_was_notified, timeout=metrics_cache_timeout)

View file

@ -3,53 +3,77 @@ import typing
from django.core.cache import cache
from prometheus_client import CollectorRegistry
from prometheus_client.metrics_core import GaugeMetricFamily, HistogramMetricFamily
from prometheus_client.metrics_core import CounterMetricFamily, GaugeMetricFamily, HistogramMetricFamily
from apps.alerts.constants import AlertGroupState
from apps.metrics_exporter.constants import (
ALERT_GROUPS_RESPONSE_TIME,
ALERT_GROUPS_TOTAL,
USER_WAS_NOTIFIED_OF_ALERT_GROUPS,
AlertGroupsResponseTimeMetricsDict,
AlertGroupsTotalMetricsDict,
RecalculateOrgMetricsDict,
UserWasNotifiedOfAlertGroupsMetricsDict,
)
from apps.metrics_exporter.helpers import (
get_metric_alert_groups_response_time_key,
get_metric_alert_groups_total_key,
get_metric_calculation_started_key,
get_metric_user_was_notified_of_alert_groups_key,
get_metrics_cache_timer_key,
get_organization_ids,
)
from apps.metrics_exporter.tasks import start_calculate_and_cache_metrics
from apps.metrics_exporter.tasks import start_calculate_and_cache_metrics, start_recalculation_for_new_metric
application_metrics_registry = CollectorRegistry()
RE_ALERT_GROUPS_TOTAL = re.compile(r"{}_(\d+)".format(ALERT_GROUPS_TOTAL))
RE_ALERT_GROUPS_RESPONSE_TIME = re.compile(r"{}_(\d+)".format(ALERT_GROUPS_RESPONSE_TIME))
RE_USER_WAS_NOTIFIED_OF_ALERT_GROUPS = re.compile(r"{}_(\d+)".format(USER_WAS_NOTIFIED_OF_ALERT_GROUPS))
# https://github.com/prometheus/client_python#custom-collectors
class ApplicationMetricsCollector:
def __init__(self):
self._buckets = (60, 300, 600, 3600, "+Inf")
self._labels = [
"integration",
"team",
self._stack_labels = [
"org_id",
"slug",
"id",
]
self._labels_with_state = self._labels + ["state"]
self._integration_labels = [
"integration",
"team",
] + self._stack_labels
self._integration_labels_with_state = self._integration_labels + ["state"]
self._user_labels = ["username"] + self._stack_labels
def collect(self):
alert_groups_total = GaugeMetricFamily(ALERT_GROUPS_TOTAL, "All alert groups", labels=self._labels_with_state)
alert_groups_response_time_seconds = HistogramMetricFamily(
ALERT_GROUPS_RESPONSE_TIME, "Users response time to alert groups in 7 days (seconds)", labels=self._labels
)
org_ids = set(get_organization_ids())
# alert groups total metrics
# alert groups total metric: gauge
alert_groups_total, missing_org_ids_1 = self._get_alert_groups_total_metric(org_ids)
# alert groups response time metrics: histogram
alert_groups_response_time_seconds, missing_org_ids_2 = self._get_response_time_metric(org_ids)
# user was notified of alert groups metrics: counter
user_was_notified, missing_org_ids_3 = self._get_user_was_notified_of_alert_groups_metric(org_ids)
# update new metric gradually
missing_org_ids_3 = self._update_new_metric(USER_WAS_NOTIFIED_OF_ALERT_GROUPS, org_ids, missing_org_ids_3)
# check for orgs missing any of the metrics or needing a refresh, start recalculation task for missing org ids
missing_org_ids = missing_org_ids_1 | missing_org_ids_2 | missing_org_ids_3
self.recalculate_cache_for_missing_org_ids(org_ids, missing_org_ids)
yield alert_groups_total
yield alert_groups_response_time_seconds
yield user_was_notified
def _get_alert_groups_total_metric(self, org_ids):
alert_groups_total = GaugeMetricFamily(
ALERT_GROUPS_TOTAL, "All alert groups", labels=self._integration_labels_with_state
)
processed_org_ids = set()
alert_groups_total_keys = [get_metric_alert_groups_total_key(org_id) for org_id in org_ids]
org_ag_states: typing.Dict[str, typing.Dict[int, AlertGroupsTotalMetricsDict]] = cache.get_many(
@ -57,11 +81,11 @@ class ApplicationMetricsCollector:
)
for org_key, ag_states in org_ag_states.items():
for integration, integration_data in ag_states.items():
# Labels values should have the same order as _labels
# Labels values should have the same order as _integration_labels_with_state
labels_values = [
integration_data["integration_name"], # integration
integration_data["team_name"], # team
integration_data["org_id"], # org_id
integration_data["org_id"], # grafana org_id
integration_data["slug"], # grafana instance slug
integration_data["id"], # grafana instance id
]
@ -70,10 +94,15 @@ class ApplicationMetricsCollector:
alert_groups_total.add_metric(labels_values + [state.value], integration_data[state.value])
org_id_from_key = RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
processed_org_ids.add(int(org_id_from_key))
# get missing orgs
missing_org_ids_1 = org_ids - processed_org_ids
missing_org_ids = org_ids - processed_org_ids
return alert_groups_total, missing_org_ids
# alert groups response time metrics
def _get_response_time_metric(self, org_ids):
alert_groups_response_time_seconds = HistogramMetricFamily(
ALERT_GROUPS_RESPONSE_TIME,
"Users response time to alert groups in 7 days (seconds)",
labels=self._integration_labels,
)
processed_org_ids = set()
alert_groups_response_time_keys = [get_metric_alert_groups_response_time_key(org_id) for org_id in org_ids]
org_ag_response_times: typing.Dict[str, typing.Dict[int, AlertGroupsResponseTimeMetricsDict]] = cache.get_many(
@ -81,11 +110,11 @@ class ApplicationMetricsCollector:
)
for org_key, ag_response_time in org_ag_response_times.items():
for integration, integration_data in ag_response_time.items():
# Labels values should have the same order as _labels
# Labels values should have the same order as _integration_labels
labels_values = [
integration_data["integration_name"], # integration
integration_data["team_name"], # team
integration_data["org_id"], # org_id
integration_data["org_id"], # grafana org_id
integration_data["slug"], # grafana instance slug
integration_data["id"], # grafana instance id
]
@ -99,11 +128,47 @@ class ApplicationMetricsCollector:
alert_groups_response_time_seconds.add_metric(labels_values, buckets=buckets, sum_value=sum_value)
org_id_from_key = RE_ALERT_GROUPS_RESPONSE_TIME.match(org_key).groups()[0]
processed_org_ids.add(int(org_id_from_key))
# get missing orgs
missing_org_ids_2 = org_ids - processed_org_ids
missing_org_ids = org_ids - processed_org_ids
return alert_groups_response_time_seconds, missing_org_ids
# check for orgs missing any of the metrics or needing a refresh
missing_org_ids = missing_org_ids_1 | missing_org_ids_2
def _get_user_was_notified_of_alert_groups_metric(self, org_ids):
user_was_notified = CounterMetricFamily(
USER_WAS_NOTIFIED_OF_ALERT_GROUPS, "Number of alert groups user was notified of", labels=self._user_labels
)
processed_org_ids = set()
user_was_notified_keys = [get_metric_user_was_notified_of_alert_groups_key(org_id) for org_id in org_ids]
org_users: typing.Dict[str, typing.Dict[int, UserWasNotifiedOfAlertGroupsMetricsDict]] = cache.get_many(
user_was_notified_keys
)
for org_key, users in org_users.items():
for user, user_data in users.items():
# Labels values should have the same order as _user_labels
labels_values = [
user_data["user_username"], # username
user_data["org_id"], # grafana org_id
user_data["slug"], # grafana instance slug
user_data["id"], # grafana instance id
]
labels_values = list(map(str, labels_values))
user_was_notified.add_metric(labels_values, user_data["counter"])
org_id_from_key = RE_USER_WAS_NOTIFIED_OF_ALERT_GROUPS.match(org_key).groups()[0]
processed_org_ids.add(int(org_id_from_key))
missing_org_ids = org_ids - processed_org_ids
return user_was_notified, missing_org_ids
def _update_new_metric(self, metric_name, org_ids, missing_org_ids):
"""
This method is used for new metrics to calculate metrics gradually and avoid force recalculation for all orgs
"""
calculation_started_key = get_metric_calculation_started_key(metric_name)
is_calculation_started = cache.get(calculation_started_key)
if len(missing_org_ids) == len(org_ids) or is_calculation_started:
missing_org_ids = set()
if not is_calculation_started:
start_recalculation_for_new_metric.apply_async((metric_name,))
return missing_org_ids
def recalculate_cache_for_missing_org_ids(self, org_ids, missing_org_ids):
cache_timer_for_org_keys = [get_metrics_cache_timer_key(org_id) for org_id in org_ids]
cache_timers_for_org = cache.get_many(cache_timer_for_org_keys)
recalculate_orgs: typing.List[RecalculateOrgMetricsDict] = []
@ -114,9 +179,6 @@ class ApplicationMetricsCollector:
if recalculate_orgs:
start_calculate_and_cache_metrics.apply_async((recalculate_orgs,))
yield alert_groups_total
yield alert_groups_response_time_seconds
def get_buckets_with_sum(self, values):
"""Put values in correct buckets and count values sum"""
buckets_values = {str(key): 0 for key in self._buckets}

View file

@ -3,7 +3,7 @@ import typing
from django.apps import apps
from django.conf import settings
from django.core.cache import cache
from django.db.models import Q
from django.db.models import Count, Q
from apps.alerts.constants import AlertGroupState
from apps.metrics_exporter.constants import (
@ -11,17 +11,22 @@ from apps.metrics_exporter.constants import (
METRICS_ORGANIZATIONS_IDS_CACHE_TIMEOUT,
AlertGroupsResponseTimeMetricsDict,
AlertGroupsTotalMetricsDict,
RecalculateMetricsTimer,
RecalculateOrgMetricsDict,
UserWasNotifiedOfAlertGroupsMetricsDict,
)
from apps.metrics_exporter.helpers import (
get_metric_alert_groups_response_time_key,
get_metric_alert_groups_total_key,
get_metric_calculation_started_key,
get_metric_user_was_notified_of_alert_groups_key,
get_metrics_cache_timer_key,
get_metrics_recalculation_timeout,
get_organization_ids,
get_organization_ids_from_db,
get_response_time_period,
is_allowed_to_start_metrics_calculation,
)
from apps.user_management.models import User
from common.custom_celery_tasks import shared_dedicated_queue_retry_task
from common.database import get_random_readonly_database_key_if_present_otherwise_default
@ -40,9 +45,30 @@ def save_organizations_ids_in_cache():
def start_calculate_and_cache_metrics(metrics_to_recalculate: list[RecalculateOrgMetricsDict]):
"""Start calculation metrics for each object in metrics_to_recalculate"""
for counter, recalculation_data in enumerate(metrics_to_recalculate):
if not is_allowed_to_start_metrics_calculation(**recalculation_data):
continue
# start immediately if recalculation starting has been forced
countdown = 0 if recalculation_data.get("force") else counter
calculate_and_cache_metrics.apply_async(kwargs=recalculation_data, countdown=countdown)
calculate_and_cache_user_was_notified_metric.apply_async(
(recalculation_data["organization_id"],), countdown=countdown
)
@shared_dedicated_queue_retry_task(autoretry_for=(Exception,), retry_backoff=True, max_retries=0)
def start_recalculation_for_new_metric(metric_name):
TEN_MINUTES = 600
calculation_started_key = get_metric_calculation_started_key(metric_name)
is_calculation_started = cache.get(calculation_started_key)
if is_calculation_started:
return
cache.set(calculation_started_key, True, timeout=TEN_MINUTES)
org_ids = set(get_organization_ids())
countdown = 0
for counter, organization_id in enumerate(org_ids):
if counter % 10 == 0:
countdown += 1
calculate_and_cache_user_was_notified_metric.apply_async((organization_id,), countdown=countdown)
@shared_dedicated_queue_retry_task(
@ -50,41 +76,31 @@ def start_calculate_and_cache_metrics(metrics_to_recalculate: list[RecalculateOr
)
def calculate_and_cache_metrics(organization_id, force=False):
"""
Calculate metrics for organization.
Before calculation checks if calculation has already been started to avoid too frequent launch or parallel launch
Calculate integrations metrics for organization.
"""
AlertGroup = apps.get_model("alerts", "AlertGroup")
AlertReceiveChannel = apps.get_model("alerts", "AlertReceiveChannel")
Organization = apps.get_model("user_management", "Organization")
ONE_HOUR = 3600
TWO_HOURS = 7200
recalculate_timeout = get_metrics_recalculation_timeout()
# check if recalculation has been already started
metrics_cache_timer_key = get_metrics_cache_timer_key(organization_id)
metrics_cache_timer = cache.get(metrics_cache_timer_key)
if metrics_cache_timer:
if not force or metrics_cache_timer.get("forced_started", False):
return
else:
metrics_cache_timer["forced_started"] = True
else:
metrics_cache_timer: RecalculateMetricsTimer = {
"recalculate_timeout": recalculate_timeout,
"forced_started": force,
}
metrics_cache_timer["recalculate_timeout"] = recalculate_timeout
cache.set(metrics_cache_timer_key, metrics_cache_timer, timeout=recalculate_timeout)
organization = Organization.objects.filter(pk=organization_id).first()
if not organization:
return
integrations = (
AlertReceiveChannel.objects.using(get_random_readonly_database_key_if_present_otherwise_default())
.filter(~Q(integration=AlertReceiveChannel.INTEGRATION_MAINTENANCE) & Q(organization_id=organization_id))
.select_related("organization", "team")
.select_related("team")
)
response_time_period = get_response_time_period()
instance_slug = organization.stack_slug
instance_id = organization.stack_id
instance_org_id = organization.org_id
metric_alert_group_total: typing.Dict[int, AlertGroupsTotalMetricsDict] = {}
metric_alert_group_response_time: typing.Dict[int, AlertGroupsResponseTimeMetricsDict] = {}
@ -96,8 +112,6 @@ def calculate_and_cache_metrics(organization_id, force=False):
}
for integration in integrations:
instance_slug = integration.organization.stack_slug
instance_id = integration.organization.stack_id
# calculate states
for state, alert_group_filter in states.items():
metric_alert_group_total.setdefault(
@ -106,7 +120,7 @@ def calculate_and_cache_metrics(organization_id, force=False):
"integration_name": integration.emojized_verbal_name,
"team_name": integration.team_name,
"team_id": integration.team_id_or_no_team,
"org_id": integration.organization.org_id,
"org_id": instance_org_id,
"slug": instance_slug,
"id": instance_id,
},
@ -124,7 +138,7 @@ def calculate_and_cache_metrics(organization_id, force=False):
"integration_name": integration.emojized_verbal_name,
"team_name": integration.team_name,
"team_id": integration.team_id_or_no_team,
"org_id": integration.organization.org_id,
"org_id": instance_org_id,
"slug": instance_slug,
"id": instance_id,
"response_time": all_response_time_seconds,
@ -133,9 +147,68 @@ def calculate_and_cache_metrics(organization_id, force=False):
metric_alert_groups_total_key = get_metric_alert_groups_total_key(organization_id)
metric_alert_groups_response_time_key = get_metric_alert_groups_response_time_key(organization_id)
recalculate_timeout = get_metrics_recalculation_timeout()
metrics_cache_timeout = recalculate_timeout + TWO_HOURS
cache.set(metric_alert_groups_total_key, metric_alert_group_total, timeout=metrics_cache_timeout)
cache.set(metric_alert_groups_response_time_key, metric_alert_group_response_time, timeout=metrics_cache_timeout)
if metrics_cache_timer["forced_started"]:
if force:
metrics_cache_timer_key = get_metrics_cache_timer_key(organization_id)
metrics_cache_timer = cache.get(metrics_cache_timer_key)
metrics_cache_timer["forced_started"] = False
cache.set(metrics_cache_timer_key, metrics_cache_timer, timeout=recalculate_timeout - ONE_HOUR)
@shared_dedicated_queue_retry_task(
autoretry_for=(Exception,), retry_backoff=True, max_retries=1 if settings.DEBUG else None
)
def calculate_and_cache_user_was_notified_metric(organization_id):
"""
Calculate metric "user_was_notified_of_alert_groups" for organization.
"""
UserNotificationPolicyLogRecord = apps.get_model("base", "UserNotificationPolicyLogRecord")
Organization = apps.get_model("user_management", "Organization")
TWO_HOURS = 7200
organization = Organization.objects.filter(pk=organization_id).first()
if not organization:
return
users = (
User.objects.using(get_random_readonly_database_key_if_present_otherwise_default())
.filter(organization_id=organization_id)
.annotate(num_logs=Count("personal_log_records"))
.filter(num_logs__gte=1)
)
instance_slug = organization.stack_slug
instance_id = organization.stack_id
instance_org_id = organization.org_id
metric_user_was_notified: typing.Dict[int, UserWasNotifiedOfAlertGroupsMetricsDict] = {}
for user in users:
counter = (
user.personal_log_records.filter(type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_TRIGGERED)
.values("alert_group")
.distinct()
.count()
)
if counter == 0: # means that user has no successful notifications
continue
metric_user_was_notified[user.id] = {
"user_username": user.username,
"org_id": instance_org_id,
"slug": instance_slug,
"id": instance_id,
"counter": counter,
}
metric_user_was_notified_key = get_metric_user_was_notified_of_alert_groups_key(organization_id)
recalculate_timeout = get_metrics_recalculation_timeout()
metrics_cache_timeout = recalculate_timeout + TWO_HOURS
cache.set(metric_user_was_notified_key, metric_user_was_notified, timeout=metrics_cache_timeout)

View file

@ -1,13 +1,22 @@
import pytest
from django.core.cache import cache
from apps.metrics_exporter.constants import ALERT_GROUPS_RESPONSE_TIME, ALERT_GROUPS_TOTAL
from apps.metrics_exporter.helpers import get_metric_alert_groups_response_time_key, get_metric_alert_groups_total_key
from apps.metrics_exporter.constants import (
ALERT_GROUPS_RESPONSE_TIME,
ALERT_GROUPS_TOTAL,
USER_WAS_NOTIFIED_OF_ALERT_GROUPS,
)
from apps.metrics_exporter.helpers import (
get_metric_alert_groups_response_time_key,
get_metric_alert_groups_total_key,
get_metric_user_was_notified_of_alert_groups_key,
)
METRICS_TEST_INTEGRATION_NAME = "Test integration"
METRICS_TEST_ORG_ID = 123 # random number
METRICS_TEST_INSTANCE_SLUG = "test_instance"
METRICS_TEST_INSTANCE_ID = 292 # random number
METRICS_TEST_USER_USERNAME = "Alex"
@pytest.fixture()
@ -17,6 +26,8 @@ def mock_cache_get_metrics_for_collector(monkeypatch):
key = ALERT_GROUPS_TOTAL
elif key.startswith(ALERT_GROUPS_RESPONSE_TIME):
key = ALERT_GROUPS_RESPONSE_TIME
elif key.startswith(USER_WAS_NOTIFIED_OF_ALERT_GROUPS):
key = USER_WAS_NOTIFIED_OF_ALERT_GROUPS
test_metrics = {
ALERT_GROUPS_TOTAL: {
1: {
@ -43,6 +54,15 @@ def mock_cache_get_metrics_for_collector(monkeypatch):
"response_time": [2, 10, 200, 650],
}
},
USER_WAS_NOTIFIED_OF_ALERT_GROUPS: {
1: {
"org_id": 1,
"slug": "Test stack",
"id": 1,
"user_username": "Alex",
"counter": 4,
}
},
}
return test_metrics.get(key)
@ -106,3 +126,27 @@ def make_metrics_cache_params(monkeypatch):
return cache_get
return _make_cache_params
@pytest.fixture
def make_user_was_notified_metrics_cache_params(monkeypatch):
def _make_cache_params(user_id, organization_id):
metric_user_was_notified_key = get_metric_user_was_notified_of_alert_groups_key(organization_id)
def cache_get(key, *args, **kwargs):
metrics_data = {
metric_user_was_notified_key: {
user_id: {
"org_id": METRICS_TEST_ORG_ID,
"slug": METRICS_TEST_INSTANCE_SLUG,
"id": METRICS_TEST_INSTANCE_ID,
"user_username": METRICS_TEST_USER_USERNAME,
"counter": 1,
}
},
}
return metrics_data.get(key, {})
return cache_get
return _make_cache_params

View file

@ -2,12 +2,13 @@ from unittest.mock import patch
import pytest
from apps.base.models import UserNotificationPolicyLogRecord
from apps.metrics_exporter.helpers import (
get_metric_alert_groups_response_time_key,
get_metric_alert_groups_total_key,
get_metrics_cache_timer_key,
get_metric_user_was_notified_of_alert_groups_key,
)
from apps.metrics_exporter.tasks import calculate_and_cache_metrics
from apps.metrics_exporter.tasks import calculate_and_cache_metrics, calculate_and_cache_user_was_notified_metric
@patch("apps.alerts.models.alert_group.MetricsCacheManager.metrics_update_state_cache_for_alert_group")
@ -44,7 +45,6 @@ def test_calculate_and_cache_metrics_task(
make_alert(alert_group=alert_group_to_sil, raw_request_data={})
alert_group_to_sil.silence()
metrics_cache_timer_key = get_metrics_cache_timer_key(organization.id)
metric_alert_groups_total_key = get_metric_alert_groups_total_key(organization.id)
metric_alert_groups_response_time_key = get_metric_alert_groups_response_time_key(organization.id)
@ -98,18 +98,84 @@ def test_calculate_and_cache_metrics_task(
with patch("apps.metrics_exporter.tasks.cache.set") as mock_cache_set:
calculate_and_cache_metrics(organization.id)
args = mock_cache_set.call_args_list
assert args[0].args[0] == metrics_cache_timer_key
# check alert_groups_total metric cache
metric_alert_groups_total_values = args[1].args
metric_alert_groups_total_values = args[0].args
assert metric_alert_groups_total_values[0] == metric_alert_groups_total_key
assert metric_alert_groups_total_values[1] == expected_result_metric_alert_groups_total
# check alert_groups_response_time metric cache
metric_alert_groups_response_time_values = args[2].args
metric_alert_groups_response_time_values = args[1].args
assert metric_alert_groups_response_time_values[0] == metric_alert_groups_response_time_key
for integration_id, values in metric_alert_groups_response_time_values[1].items():
assert len(values["response_time"]) == METRICS_RESPONSE_TIME_LEN
# set response time to expected result because it is calculated on fly
expected_result_metric_alert_groups_response_time[integration_id]["response_time"] = values["response_time"]
assert metric_alert_groups_response_time_values[1] == expected_result_metric_alert_groups_response_time
@patch("apps.alerts.models.alert_group.MetricsCacheManager.metrics_update_state_cache_for_alert_group")
@pytest.mark.django_db
def test_calculate_and_cache_user_was_notified_metric_task(
mocked_update_state_cache,
make_organization,
make_user_for_organization,
make_team,
make_alert_receive_channel,
make_alert_group,
make_alert,
make_user_notification_policy_log_record,
):
organization = make_organization()
team = make_team(organization)
user_1 = make_user_for_organization(organization)
user_2 = make_user_for_organization(organization)
alert_receive_channel_1 = make_alert_receive_channel(organization)
alert_receive_channel_2 = make_alert_receive_channel(organization, team=team)
alert_group_1 = make_alert_group(alert_receive_channel_1)
alert_group_2 = make_alert_group(alert_receive_channel_2)
for _ in range(2):
make_user_notification_policy_log_record(
type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_TRIGGERED,
author=user_1,
alert_group=alert_group_1,
)
for user in [user_1, user_2]:
make_user_notification_policy_log_record(
type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_TRIGGERED,
author=user,
alert_group=alert_group_2,
)
metric_user_was_notified_key = get_metric_user_was_notified_of_alert_groups_key(organization.id)
expected_result_metric_user_was_notified = {
user_1.id: {
"org_id": organization.org_id,
"slug": organization.stack_slug,
"id": organization.stack_id,
"user_username": user_1.username,
"counter": 2,
},
user_2.id: {
"org_id": organization.org_id,
"slug": organization.stack_slug,
"id": organization.stack_id,
"user_username": user_2.username,
"counter": 1,
},
}
with patch("apps.metrics_exporter.tasks.cache.set") as mock_cache_set:
calculate_and_cache_user_was_notified_metric(organization.id)
args = mock_cache_set.call_args_list
# check user_was_notified_of_alert_groups metric cache
metric_user_was_notified_of_alert_groups_values = args[0].args
assert metric_user_was_notified_of_alert_groups_values[0] == metric_user_was_notified_key
assert metric_user_was_notified_of_alert_groups_values[1] == expected_result_metric_user_was_notified

View file

@ -4,7 +4,11 @@ import pytest
from prometheus_client import CollectorRegistry, generate_latest
from apps.alerts.constants import AlertGroupState
from apps.metrics_exporter.constants import ALERT_GROUPS_RESPONSE_TIME, ALERT_GROUPS_TOTAL
from apps.metrics_exporter.constants import (
ALERT_GROUPS_RESPONSE_TIME,
ALERT_GROUPS_TOTAL,
USER_WAS_NOTIFIED_OF_ALERT_GROUPS,
)
from apps.metrics_exporter.metrics_collectors import ApplicationMetricsCollector
@ -25,6 +29,9 @@ def test_application_metrics_collector(
elif metric.name == ALERT_GROUPS_RESPONSE_TIME:
# integration with labels for each value in collector's bucket + _count and _sum histogram values
assert len(metric.samples) == len(collector._buckets) + 2
elif metric.name == USER_WAS_NOTIFIED_OF_ALERT_GROUPS:
# metric with labels for each notified user
assert len(metric.samples) == 1
result = generate_latest(test_metrics_registry).decode("utf-8")
assert result is not None
assert mocked_org_ids.called

View file

@ -4,9 +4,12 @@ import pytest
from django.core.cache import cache
from django.test import override_settings
from apps.alerts.tasks import notify_user_task
from apps.base.models import UserNotificationPolicy, UserNotificationPolicyLogRecord
from apps.metrics_exporter.helpers import (
get_metric_alert_groups_response_time_key,
get_metric_alert_groups_total_key,
get_metric_user_was_notified_of_alert_groups_key,
metrics_bulk_update_team_label_cache,
)
from apps.metrics_exporter.metrics_cache_manager import MetricsCacheManager
@ -15,6 +18,7 @@ from apps.metrics_exporter.tests.conftest import (
METRICS_TEST_INSTANCE_SLUG,
METRICS_TEST_INTEGRATION_NAME,
METRICS_TEST_ORG_ID,
METRICS_TEST_USER_USERNAME,
)
@ -450,3 +454,84 @@ def test_update_metrics_cache_on_update_team(
]:
expected_result[alert_receive_channel.id].update(expected_result_delete_team)
get_called_arg_index_and_compare_results()
@patch("apps.alerts.tasks.notify_user.perform_notification")
@override_settings(CELERY_TASK_ALWAYS_EAGER=True)
@pytest.mark.django_db
def test_update_metrics_cache_on_user_notification(
mocked_perform_notification_task,
make_organization,
make_user_for_organization,
make_alert_receive_channel,
make_alert_group,
make_user_notification_policy,
make_user_notification_policy_log_record,
make_user_was_notified_metrics_cache_params,
monkeypatch,
mock_get_metrics_cache,
):
organization = make_organization(
org_id=METRICS_TEST_ORG_ID,
stack_slug=METRICS_TEST_INSTANCE_SLUG,
stack_id=METRICS_TEST_INSTANCE_ID,
)
alert_receive_channel = make_alert_receive_channel(
organization,
verbal_name=METRICS_TEST_INTEGRATION_NAME,
)
user = make_user_for_organization(organization, username=METRICS_TEST_USER_USERNAME)
notification_policy_1 = make_user_notification_policy(user, step=UserNotificationPolicy.Step.NOTIFY)
make_user_notification_policy(user, step=UserNotificationPolicy.Step.NOTIFY)
alert_group_1 = make_alert_group(alert_receive_channel)
alert_group_2 = make_alert_group(alert_receive_channel)
make_user_notification_policy_log_record(
type=UserNotificationPolicyLogRecord.TYPE_PERSONAL_NOTIFICATION_TRIGGERED,
author=user,
alert_group=alert_group_1,
)
metrics_cache = make_user_was_notified_metrics_cache_params(user.id, organization.id)
monkeypatch.setattr(cache, "get", metrics_cache)
metric_user_was_notified_key = get_metric_user_was_notified_of_alert_groups_key(organization.id)
expected_result_metric_user_was_notified = {
user.id: {
"org_id": organization.org_id,
"slug": organization.stack_slug,
"id": organization.stack_id,
"user_username": METRICS_TEST_USER_USERNAME,
"counter": 1,
}
}
def get_called_arg_index_and_compare_results(cache_was_updated=False):
"""find index for the metric argument, that was set in cache"""
for idx, called_arg in enumerate(mock_cache_set_called_args):
if idx >= arg_idx and called_arg.args[0] == metric_user_was_notified_key:
assert called_arg.args[1] == expected_result_metric_user_was_notified
return idx + 1
if cache_was_updated:
raise AssertionError
return arg_idx
with patch("apps.metrics_exporter.tasks.cache.set") as mock_cache_set:
arg_idx = 0
notify_user_task(user.id, alert_group_1.id)
# check user_was_notified_of_alert_groups metric cache, get called args
mock_cache_set_called_args = mock_cache_set.call_args_list
arg_idx = get_called_arg_index_and_compare_results()
# counter grows after the first notification of alert group
notify_user_task(user.id, alert_group_2.id)
expected_result_metric_user_was_notified[user.id]["counter"] += 1
arg_idx = get_called_arg_index_and_compare_results(cache_was_updated=True)
# counter doesn't grow after the second notification of alert group
notify_user_task(user.id, alert_group_2.id, previous_notification_policy_pk=notification_policy_1.id)
arg_idx = get_called_arg_index_and_compare_results()

View file

@ -58,6 +58,7 @@ CELERY_TASK_ROUTES = {
"apps.heartbeat.tasks.process_heartbeat_task": {"queue": "default"},
"apps.heartbeat.tasks.restore_heartbeat_tasks": {"queue": "default"},
"apps.metrics_exporter.tasks.start_calculate_and_cache_metrics": {"queue": "default"},
"apps.metrics_exporter.tasks.start_recalculation_for_new_metric": {"queue": "default"},
"apps.metrics_exporter.tasks.save_organizations_ids_in_cache": {"queue": "default"},
"apps.schedules.tasks.refresh_ical_files.refresh_ical_file": {"queue": "default"},
"apps.schedules.tasks.refresh_ical_files.start_refresh_ical_files": {"queue": "default"},
@ -119,6 +120,7 @@ CELERY_TASK_ROUTES = {
"apps.grafana_plugin.tasks.sync.start_sync_organizations": {"queue": "long"},
"apps.grafana_plugin.tasks.sync.sync_organization_async": {"queue": "long"},
"apps.metrics_exporter.tasks.calculate_and_cache_metrics": {"queue": "long"},
"apps.metrics_exporter.tasks.calculate_and_cache_user_was_notified_metric": {"queue": "long"},
# SLACK
"apps.integrations.tasks.notify_about_integration_ratelimit_in_slack": {"queue": "slack"},
"apps.slack.helpers.alert_group_representative.on_alert_group_action_triggered_async": {"queue": "slack"},