Metrics doc (#2149)

# What this PR does

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Matias Bordese <mbordese@gmail.com>
This commit is contained in:
Yulya Artyukhina 2023-06-21 13:00:13 +02:00 committed by GitHub
parent f48d4f1f25
commit 0c46b41498
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
2 changed files with 99 additions and 10 deletions

View file

@ -15,7 +15,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Enable schedule related profile settings oncall [1508](https://github.com/grafana/oncall/issues/1508)
- Highlight user shifts oncall [1509](https://github.com/grafana/oncall/issues/1509)
- Rename or Description for Schedules Rotations [1460](https://github.com/grafana/oncall/issues/1406)
- Add dashboard for OnCall metrics
- Add documentation for OnCall metrics exporter ([#2149](https://github.com/grafana/oncall/pull/2149))
- Add dashboard for OnCall metrics ([#1973](https://github.com/grafana/oncall/pull/1973))
## Changed

View file

@ -6,11 +6,99 @@ keywords:
- Metrics
- Loki
- Prometheus
title: Insight Logs
title: Insight Logs and Metrics
weight: 1400
---
# Insight Logs
# Insight Logs and Metrics
## Metrics
Grafana OnCall Metrics represents certain parameters, such as:
- A total count of alert groups for each integration in every state (firing, acknowledged, resolved, silenced).
It is a gauge, and its name has the suffix `alert_groups_total`
- Response time on alert groups for each integration (mean time between the start and first action of all alert groups
for the last 7 days in selected period). It is a histogram, and its name has the suffix `alert_groups_response_time`
with the histogram suffixes such as `_bucket`, `_sum` and `_count`
You can find more information about metrics types in the [Prometheus documentation](https://prometheus.io/docs/concepts/metric_types).
To retrieve Prometheus metrics use PromQL. If you are not familiar with PromQL, check this [documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/).
### For Grafana Cloud customers
OnCall application metrics are collected in preinstalled `grafanacloud_usage` datasource and are available for every
cloud instance.
Metrics have prefix `grafanacloud_oncall_instance`, e.g. `grafanacloud_oncall_instance_alert_groups_total` and
`grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket`.
### For open source customers
To collect OnCall application metrics you need to set up Prometheus and add it to your Grafana instance as a datasource.
You can find more information about Prometheus setup in the [OSS documentation](https://github.com/grafana/oncall#readme)
Metrics will have the prefix `oncall`, e.g. `oncall_alert_groups_total` and `oncall_alert_groups_response_time_seconds_bucket`.
Your metrics may also have additional labels, such as `pod`, `instance`, `container`, depending on your Prometheus setup.
### Metric Alert groups total
This metric has the following labels:
| Label Name | Description |
|---------------|:-----------------------------------------------------------------------------:|
| `id` | ID of Grafana instance (stack) |
| `slug` | Slug of Grafana instance (stack) |
| `org_id` | ID of Grafana organization |
| `team` | Team name |
| `integration` | OnCall Integration name |
| `state` | Alert groups state. May be `firing`, `acknowledged`, `resolved` and `silenced`|
**Query example:**
Get the number of alert groups in "firing" state in integration "Grafana Alerting" in Grafana stack "test_stack":
```promql
grafanacloud_oncall_instance_alert_groups_total{slug="test_stack", integration="Grafana Alerting", state="firing"}
```
### Metric Alert groups response time
This metric has the following labels:
| Label Name | Description |
|---------------|:------------------------------------------------------------------------------:|
| `id` | ID of Grafana instance (stack) |
| `slug` | Slug of Grafana instance (stack) |
| `org_id` | ID of Grafana organization |
| `team` | Team name |
| `integration` | OnCall Integration name |
| `le` | Histogram bucket value in seconds. May be `60`, `300`, `600`, `3600` and `+Inf`|
**Query example:**
Get the number of alert groups with response time more than 10 minutes (600 seconds) in integration "Grafana Alerting"
in Grafana stack "test_stack":
```promql
grafanacloud_oncall_instance_alert_groups_response_time_seconds_bucket{slug="test_stack", integration="Grafana Alerting", le="600"}
```
### Dashboard
To import OnCall metrics dashboard go to `Administration` -> `Plugins` page, find OnCall in the plugins list, open
`Dashboards` tab at the OnCall plugin settings page and click "Import" near "OnCall metrics". After that you can find
the "OnCall metrics" dashboard in your dashboards list. In the datasource dropdown select your Prometheus datasource
(for Cloud customers it's `grafanacloud_usage`). You can filter data by your Grafana instances, teams and integrations.
To update the dashboard to the newest version go to `Dashboards` tab at the OnCall plugin settings page and click
“Re-import”.
Be aware: if you have made changes to the dashboard, they will be deleted after re-importing. To save your changes go
to the dashboard settings, click "Save as" and save a copy of the dashboard.
## Insight Logs
> **Note:** Grafana OnCall insight logs are available in Grafana Cloud only.
We're in the process of rolling out Insight Logs to all customers,
@ -29,7 +117,7 @@ You can use this query to retrieve all logs related to your OnCall instance.
{instance_type="oncall"} | logfmt | __error__=``
```
## Resource insight logs
### Resource insight logs
Logs are created each time a user modifies any resource in Grafana OnCall.
@ -39,7 +127,7 @@ These logs will have `action_type=resource` field and can be retrieved with foll
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `resource`
```
### Format
#### Format
Logs contain the following fields, where the fields followed by * are always available, and the others depend on the logged event:
@ -67,7 +155,7 @@ resource types are: `integration_heartbeat`, `escalation_chain`, `integration`,
`escalation_policy`, `public_api_token`, `schedule_export_token`,`user_schedule_export_token`,
`oncall_shift`, `web_schedule`, `ical_schedule`, `calendar_schedule`, `organization`, `user`, `webhook`.
## Maintenance insight logs
### Maintenance insight logs
Logs are created every time when a maintenance mode is started or finished for an integration.
@ -77,7 +165,7 @@ These logs will have `action_type=maintenace` field and can be retrieved with fo
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `maintenance`
```
### Format
#### Format
Logs of maintenance insights contain the following fields, where the fields followed by * are always available, and the others depend on the logged event:
@ -93,7 +181,7 @@ Logs of maintenance insights contain the following fields, where the fields foll
| `team`* | Name of team to which integration belongs. |
| `team_id` | ID of team to which integration belongs. |
## ChatOps insight logs
### ChatOps insight logs
Logs are created when user modifies ChatOps settings.
@ -103,7 +191,7 @@ These log lines will have `action_type=chat_ops` field and can be retrieved with
{instance_type="oncall"} | logfmt | __error__=`` | action_type = `chat_ops`
```
### Format
#### Format
Logs of chatops insight logs contain the following fields, where the fields followed by * are always available, and the others depend on the logged event:
@ -122,7 +210,7 @@ Logs of chatops insight logs contain the following fields, where the fields foll
chatops action names: `workspace_connected`, `workspace_disconnected`, `channel_connected`, `channel_disconnected`, `user_linked`, `used_unlinked`, `default_channel_changed`.
## Examples
### Examples
Here is some examples of practical queries to Grafana OnCall insight logs.
LogQL is used to retrieve them. If you are not familiar with LogQL check this [documentation](https://grafana.com/docs/loki/latest/logql/).