From 0c46b4149837b440c51248d098338f25955b4288 Mon Sep 17 00:00:00 2001 From: Yulya Artyukhina Date: Wed, 21 Jun 2023 13:00:13 +0200 Subject: [PATCH] Metrics doc (#2149) # What this PR does ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) --------- Co-authored-by: Matias Bordese --- CHANGELOG.md | 3 +- docs/sources/insights-and-metrics/_index.md | 106 ++++++++++++++++++-- 2 files changed, 99 insertions(+), 10 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 561b7a1c..61b10cff 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,7 +15,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Enable schedule related profile settings oncall [1508](https://github.com/grafana/oncall/issues/1508) - Highlight user shifts oncall [1509](https://github.com/grafana/oncall/issues/1509) - Rename or Description for Schedules Rotations [1460](https://github.com/grafana/oncall/issues/1406) -- Add dashboard for OnCall metrics +- Add documentation for OnCall metrics exporter ([#2149](https://github.com/grafana/oncall/pull/2149)) +- Add dashboard for OnCall metrics ([#1973](https://github.com/grafana/oncall/pull/1973)) ## Changed diff --git a/docs/sources/insights-and-metrics/_index.md b/docs/sources/insights-and-metrics/_index.md index 0fd38f71..143353e6 100644 --- a/docs/sources/insights-and-metrics/_index.md +++ b/docs/sources/insights-and-metrics/_index.md @@ -6,11 +6,99 @@ keywords: - Metrics - Loki - Prometheus -title: Insight Logs +title: Insight Logs and Metrics weight: 1400 --- -# Insight Logs +# Insight Logs and Metrics + +## Metrics + +Grafana OnCall Metrics represents certain parameters, such as: + +- A total count of alert groups for each integration in every state (firing, acknowledged, resolved, silenced). +It is a gauge, and its name has the suffix `alert_groups_total` +- Response time on alert groups for each integration (mean time between the start and first action of all alert groups +for the last 7 days in selected period). It is a histogram, and its name has the suffix `alert_groups_response_time` +with the histogram suffixes such as `_bucket`, `_sum` and `_count` + +You can find more information about metrics types in the [Prometheus documentation](https://prometheus.io/docs/concepts/metric_types). + +To retrieve Prometheus metrics use PromQL. If you are not familiar with PromQL, check this [documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/). + +### For Grafana Cloud customers + +OnCall application metrics are collected in preinstalled `grafanacloud_usage` datasource and are available for every +cloud instance. + +Metrics have prefix `grafanacloud_oncall_instance`, e.g. `grafanacloud_oncall_instance_alert_groups_total` and +`grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket`. + +### For open source customers + +To collect OnCall application metrics you need to set up Prometheus and add it to your Grafana instance as a datasource. +You can find more information about Prometheus setup in the [OSS documentation](https://github.com/grafana/oncall#readme) + +Metrics will have the prefix `oncall`, e.g. `oncall_alert_groups_total` and `oncall_alert_groups_response_time_seconds_bucket`. + +Your metrics may also have additional labels, such as `pod`, `instance`, `container`, depending on your Prometheus setup. + +### Metric Alert groups total + +This metric has the following labels: + +| Label Name | Description | +|---------------|:-----------------------------------------------------------------------------:| +| `id` | ID of Grafana instance (stack) | +| `slug` | Slug of Grafana instance (stack) | +| `org_id` | ID of Grafana organization | +| `team` | Team name | +| `integration` | OnCall Integration name | +| `state` | Alert groups state. May be `firing`, `acknowledged`, `resolved` and `silenced`| + +**Query example:** + +Get the number of alert groups in "firing" state in integration "Grafana Alerting" in Grafana stack "test_stack": + +```promql +grafanacloud_oncall_instance_alert_groups_total{slug="test_stack", integration="Grafana Alerting", state="firing"} +``` + +### Metric Alert groups response time + +This metric has the following labels: + +| Label Name | Description | +|---------------|:------------------------------------------------------------------------------:| +| `id` | ID of Grafana instance (stack) | +| `slug` | Slug of Grafana instance (stack) | +| `org_id` | ID of Grafana organization | +| `team` | Team name | +| `integration` | OnCall Integration name | +| `le` | Histogram bucket value in seconds. May be `60`, `300`, `600`, `3600` and `+Inf`| + +**Query example:** + +Get the number of alert groups with response time more than 10 minutes (600 seconds) in integration "Grafana Alerting" +in Grafana stack "test_stack": + +```promql +grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket{slug="test_stack", integration="Grafana Alerting", le="600"} +``` + +### Dashboard + +To import OnCall metrics dashboard go to `Administration` -> `Plugins` page, find OnCall in the plugins list, open +`Dashboards` tab at the OnCall plugin settings page and click "Import" near "OnCall metrics". After that you can find +the "OnCall metrics" dashboard in your dashboards list. In the datasource dropdown select your Prometheus datasource +(for Cloud customers it's `grafanacloud_usage`). You can filter data by your Grafana instances, teams and integrations. + +To update the dashboard to the newest version go to `Dashboards` tab at the OnCall plugin settings page and click +“Re-import”. +Be aware: if you have made changes to the dashboard, they will be deleted after re-importing. To save your changes go +to the dashboard settings, click "Save as" and save a copy of the dashboard. + +## Insight Logs > **Note:** Grafana OnCall insight logs are available in Grafana Cloud only. We're in the process of rolling out Insight Logs to all customers, @@ -29,7 +117,7 @@ You can use this query to retrieve all logs related to your OnCall instance. {instance_type="oncall"} | logfmt | __error__=`` ``` -## Resource insight logs +### Resource insight logs Logs are created each time a user modifies any resource in Grafana OnCall. @@ -39,7 +127,7 @@ These logs will have `action_type=resource` field and can be retrieved with foll {instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource` ``` -### Format +#### Format Logs contain the following fields, where the fields followed by * are always available, and the others depend on the logged event: @@ -67,7 +155,7 @@ resource types are: `integration_heartbeat`, `escalation_chain`, `integration`, `escalation_policy`, `public_api_token`, `schedule_export_token`,`user_schedule_export_token`, `oncall_shift`, `web_schedule`, `ical_schedule`, `calendar_schedule`, `organization`, `user`, `webhook`. -## Maintenance insight logs +### Maintenance insight logs Logs are created every time when a maintenance mode is started or finished for an integration. @@ -77,7 +165,7 @@ These logs will have `action_type=maintenace` field and can be retrieved with fo {instance_type="oncall"} | logfmt | __error__=`` | action_type = `maintenance` ``` -### Format +#### Format Logs of maintenance insights contain the following fields, where the fields followed by * are always available, and the others depend on the logged event: @@ -93,7 +181,7 @@ Logs of maintenance insights contain the following fields, where the fields foll | `team`* | Name of team to which integration belongs. | | `team_id` | ID of team to which integration belongs. | -## ChatOps insight logs +### ChatOps insight logs Logs are created when user modifies ChatOps settings. @@ -103,7 +191,7 @@ These log lines will have `action_type=chat_ops` field and can be retrieved with {instance_type="oncall"} | logfmt | __error__=`` | action_type = `chat_ops` ``` -### Format +#### Format Logs of chatops insight logs contain the following fields, where the fields followed by * are always available, and the others depend on the logged event: @@ -122,7 +210,7 @@ Logs of chatops insight logs contain the following fields, where the fields foll chatops action names: `workspace_connected`, `workspace_disconnected`, `channel_connected`, `channel_disconnected`, `user_linked`, `used_unlinked`, `default_channel_changed`. -## Examples +### Examples Here is some examples of practical queries to Grafana OnCall insight logs. LogQL is used to retrieve them. If you are not familiar with LogQL check this [documentation](https://grafana.com/docs/loki/latest/logql/).