Metrics doc (#2149)

# What this PR does ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) --------- Co-authored-by: Matias Bordese <mbordese@gmail.com>
2023-06-21 13:00:13 +02:00 · 2023-06-21 13:00:13 +02:00 · 0c46b41498
commit 0c46b41498
parent f48d4f1f25
2 changed files with 99 additions and 10 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -15,7 +15,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Enable schedule related profile settings oncall [1508](https://github.com/grafana/oncall/issues/1508)
 - Highlight user shifts oncall [1509](https://github.com/grafana/oncall/issues/1509)
 - Rename or Description for Schedules Rotations [1460](https://github.com/grafana/oncall/issues/1406)
- Add dashboard for OnCall metrics
+- Add documentation for OnCall metrics exporter ([#2149](https://github.com/grafana/oncall/pull/2149))
+- Add dashboard for OnCall metrics ([#1973](https://github.com/grafana/oncall/pull/1973))

 ## Changed

--- a/docs/sources/insights-and-metrics/_index.md
+++ b/docs/sources/insights-and-metrics/_index.md
@ -6,11 +6,99 @@ keywords:
  - Metrics
  - Loki
  - Prometheus
-title: Insight Logs
+title: Insight Logs and Metrics
 weight: 1400
 ---

-# Insight Logs
+# Insight Logs and Metrics
+
+## Metrics
+
+Grafana OnCall Metrics represents certain parameters, such as:
+
+- A total count of alert groups for each integration in every state (firing, acknowledged, resolved, silenced).
+It is a gauge, and its name has the suffix `alert_groups_total`
+- Response time on alert groups for each integration (mean time between the start and first action of all alert groups
+for the last 7 days in selected period). It is a histogram, and its name has the suffix `alert_groups_response_time`
+with the histogram suffixes such as `_bucket`, `_sum` and `_count`
+
+You can find more information about metrics types in the [Prometheus documentation](https://prometheus.io/docs/concepts/metric_types).
+
+To retrieve Prometheus metrics use PromQL. If you are not familiar with PromQL, check this [documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/).
+
+### For Grafana Cloud customers
+
+OnCall application metrics are collected in preinstalled `grafanacloud_usage` datasource and are available for every
+cloud instance.
+
+Metrics have prefix `grafanacloud_oncall_instance`, e.g. `grafanacloud_oncall_instance_alert_groups_total` and
+`grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket`.
+
+### For open source customers
+
+To collect OnCall application metrics you need to set up Prometheus and add it to your Grafana instance as a datasource.
+You can find more information about Prometheus setup in the [OSS documentation](https://github.com/grafana/oncall#readme)
+
+Metrics will have the prefix `oncall`, e.g. `oncall_alert_groups_total` and `oncall_alert_groups_response_time_seconds_bucket`.
+
+Your metrics may also have additional labels, such as `pod`, `instance`, `container`, depending on your Prometheus setup.
+
+### Metric Alert groups total
+
+This metric has the following labels:
+
+| Label Name    |                                 Description                                   |
+|---------------|:-----------------------------------------------------------------------------:|
+| `id`          | ID of Grafana instance (stack)                                                |
+| `slug`        | Slug of Grafana instance (stack)                                              |
+| `org_id`      | ID of Grafana organization                                                    |
+| `team`        | Team name                                                                     |
+| `integration` | OnCall Integration name                                                       |
+| `state`       | Alert groups state. May be `firing`, `acknowledged`, `resolved` and `silenced`|
+
+**Query example:**
+
+Get the number of alert groups in "firing" state in integration "Grafana Alerting" in Grafana stack "test_stack":
+
+```promql
+grafanacloud_oncall_instance_alert_groups_total{slug="test_stack", integration="Grafana Alerting", state="firing"}
+```
+
+### Metric Alert groups response time
+
+This metric has the following labels:
+
+| Label Name    |                                 Description                                    |
+|---------------|:------------------------------------------------------------------------------:|
+| `id`          | ID of Grafana instance (stack)                                                 |
+| `slug`        | Slug of Grafana instance (stack)                                               |
+| `org_id`      | ID of Grafana organization                                                     |
+| `team`        | Team name                                                                      |
+| `integration` | OnCall Integration name                                                        |
+| `le`          | Histogram bucket value in seconds. May be `60`, `300`, `600`, `3600` and `+Inf`|
+
+**Query example:**
+
+Get the number of alert groups with response time more than 10 minutes (600 seconds) in integration "Grafana Alerting"
+in Grafana stack "test_stack":
+
+```promql
+grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket{slug="test_stack", integration="Grafana Alerting", le="600"}
+```
+
+### Dashboard
+
+To import OnCall metrics dashboard go to `Administration` -> `Plugins` page, find OnCall in the plugins list, open
+`Dashboards` tab at the OnCall plugin settings page and click "Import" near "OnCall metrics". After that you can find
+the "OnCall metrics" dashboard in your dashboards list. In the datasource dropdown select your Prometheus datasource
+(for Cloud customers it's `grafanacloud_usage`). You can filter data by your Grafana instances, teams and integrations.
+
+To update the dashboard to the newest version go to `Dashboards` tab at the OnCall plugin settings page and click
+“Re-import”.
+Be aware: if you have made changes to the dashboard, they will be deleted after re-importing. To save your changes go
+to the dashboard settings, click "Save as" and save a copy of the dashboard.
+
+## Insight Logs

 > **Note:** Grafana OnCall insight logs are available in Grafana Cloud only.
 We're in the process of rolling out Insight Logs to all customers,
@ -29,7 +117,7 @@ You can use this query to retrieve all logs related to your OnCall instance.
 {instance_type="oncall"} | logfmt | __error__=``
 ```

-## Resource insight logs
+### Resource insight logs

 Logs are created each time a user modifies any resource in Grafana OnCall.

@ -39,7 +127,7 @@ These logs will have `action_type=resource` field and can be retrieved with foll
 {instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource`
 ```

-### Format
+#### Format

 Logs contain the following fields, where the fields followed by * are always available, and the others depend on the logged event:

@ -67,7 +155,7 @@ resource types are: `integration_heartbeat`, `escalation_chain`, `integration`,
 `escalation_policy`, `public_api_token`, `schedule_export_token`,`user_schedule_export_token`,
 `oncall_shift`, `web_schedule`, `ical_schedule`, `calendar_schedule`, `organization`, `user`, `webhook`.

-## Maintenance insight logs
+### Maintenance insight logs

 Logs are created every time when a maintenance mode is started or finished for an integration.

@ -77,7 +165,7 @@ These logs will have `action_type=maintenace` field and can be retrieved with fo
 {instance_type="oncall"} | logfmt | __error__=`` | action_type = `maintenance`
 ```

-### Format
+#### Format

 Logs of maintenance insights contain the following fields, where the fields followed by * are always available, and the others depend on the logged event:

@ -93,7 +181,7 @@ Logs of maintenance insights contain the following fields, where the fields foll
 | `team`*             |                Name of team to which integration belongs.                |
 | `team_id`           |                 ID of team to which integration belongs.                 |

-## ChatOps insight logs
+### ChatOps insight logs

 Logs are created when user modifies ChatOps settings.

@ -103,7 +191,7 @@ These log lines will have `action_type=chat_ops` field and can be retrieved with
 {instance_type="oncall"} | logfmt | __error__=`` | action_type = `chat_ops`
 ```

-### Format
+#### Format

 Logs of chatops insight logs contain the following fields, where the fields followed by * are always available, and the others depend on the logged event:

@ -122,7 +210,7 @@ Logs of chatops insight logs contain the following fields, where the fields foll

 chatops action names: `workspace_connected`, `workspace_disconnected`, `channel_connected`, `channel_disconnected`, `user_linked`, `used_unlinked`, `default_channel_changed`.

-## Examples
+### Examples

 Here is some examples of practical queries to Grafana OnCall insight logs.
 LogQL is used to retrieve them. If you are not familiar with LogQL check this [documentation](https://grafana.com/docs/loki/latest/logql/).