Commit graph

29 commits

Author SHA1 Message Date
Michael Derynck
2545bf8336
Split up organizations across metrics exporters (#5127)
# What this PR does
Limits organizations that a metrics exporter is responsible for. As more
organizations are added it becomes more difficult for the exporter to
deliver metrics within the scrape timeout. This would let us use the
settings to divide up the organizations between multiple exporters.

## Which issue(s) this PR closes

Related to [issue link here]

<!--
*Note*: If you want the issue to be auto-closed once the PR is merged,
change "Related to" to "Closes" in the line above.
If you have more than one GitHub issue that this PR closes, be sure to
preface
each issue link with a [closing
keyword](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue).
This ensures that the issue(s) are auto-closed once the PR has been
merged.
-->

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-10-08 17:29:36 +00:00
Yulya Artyukhina
29bd42c0b1
Fix collecting metrics (#4822)
# What this PR does
Reverts the accidental removal of the ApplicationMetricsCollector from
the metric register

## Which issue(s) this PR closes

Related to [issue link here]

<!--
*Note*: If you want the issue to be auto-closed once the PR is merged,
change "Related to" to "Closes" in the line above.
If you have more than one GitHub issue that this PR closes, be sure to
preface
each issue link with a [closing
keyword](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue).
This ensures that the issue(s) are auto-closed once the PR has been
merged.
-->

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-08-14 13:53:43 +00:00
Yulya Artyukhina
503939783f
Add settings var to choose application metrics to collect (#4781)
# What this PR does
Adds settings var `METRICS_TO_COLLECT` to choose what metrics should be
collected by `ApplicationMetricsCollector`.
It allows to collect different application metrics using different
exporters.

## Which issue(s) this PR closes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-08-12 10:37:48 +00:00
Yulya Artyukhina
4e03732fb1
Drop stale metrics cache (#4695)
# What this PR does
based on this PR - https://github.com/grafana/oncall/pull/4517

## Which issue(s) this PR closes

Related to https://github.com/grafana/oncall/issues/4455

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-07-17 17:01:04 +00:00
Yulya Artyukhina
191814b25e
User notifications bundle (#4457)
# What this PR does
This PR adds two new models: UserNotificationBundle and
BundledNotification (proposals for naming are welcome).

`UserNotificationBundle` manages the information about last notification
time and scheduled notification task for bundled notifications. It is
unique per user + notification_channel + notification importance.

`BundledNotification` contains notification policy and alert group, that
triggered the notification. The BundledNotification instance is created
in `notify_user_task` for every notification, that should be bundled,
and is attached to UserNotificationBundle by ForeignKey connection.

How it works:
If the user was notified recently (within the last two minutes) by the
current notification channel, and this channel is bundlable,
BundledNotification instance will be created and attached to the
UserNotificationBundle instance, and `send_bundled_notification` task
will be scheduled to execute in 2 min.
In `send_bundled_notification` task we get all BundledNotification
attached to the current UserNotificationBundle instance, check if alert
groups are still active and if there is only one notification - perform
regular notification by calling `perform_notification` task, otherwise
call "notify_by_<channel>_bundle" method for the current notification
channel.

PR with method to send notification bundle by SMS -
https://github.com/grafana/oncall/pull/4624

**This feature is disabled by default by feature flag. Public docs will
be added in a separate PR with enabling this feature.**
## Which issue(s) this PR closes
related to https://github.com/grafana/oncall-private/issues/2712

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-07-16 11:24:08 +00:00
Yulya Artyukhina
f583da5b56
Add service_name label to insight metrics (#4300)
# What this PR does
Adds `service_name` label to insight metrics
NOTE: It is related to [this
PR](https://github.com/grafana/oncall/pull/4227) and should be merged no
sooner than two days after the next release (current release version is
1.4.4), because we need to wait for the metrics cache to be updated for
all organizations (uses the new cache structure with `services`)

## Which issue(s) this PR closes
Related to https://github.com/grafana/oncall-private/issues/2610

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-05-22 14:17:42 +00:00
Matias Bordese
b07b140383
Avoid generating response time value metrics for empty integrations (#4339)
This should help with ongoing issue generating too big metrics payloads
(issue introduced when including service name labels).
2024-05-13 17:23:22 +00:00
Yulya Artyukhina
0790d45ab5
Fix calculating metrics from different services in metrics collector (#4297)
# What this PR does
Fix calculating metric values per integration from different services

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-04-29 12:08:53 +00:00
Yulya Artyukhina
d1085b718c
Prepare insight metrics structure for adding service_name label (#4227)
# What this PR does
Prepare insight metrics for adding `service_name` label.
This PR updates metrics cache structure, supporting both old and new
version of cache.
`service_name` label can be added with additional PR when all metric
cache is updated.

## Which issue(s) this PR closes
https://github.com/grafana/oncall-private/issues/2610

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-04-29 09:45:23 +00:00
Yulya Artyukhina
3c93375244
Update alert group state by backsync (#4089)
# What this PR does
Adds method to update alert group state by backsync
Related to https://github.com/grafana/oncall-private/issues/2542
Should be merged with
https://github.com/grafana/oncall-private/pull/2606

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-03-27 12:37:01 +00:00
Michael Derynck
2a466a0c4f
Add transaction on_commit before signals for alert group actions (#3731)
# What this PR does
Add transactions around log record creation and check transaction
on_commit before sending signals passing DB id of alert group log
records. In cases for delete we can then assume any missing IDs on tasks
are from intentionally deleted alert groups and we can stop tasks from
retrying endlessly.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-01-31 15:54:50 -07:00
Joey Orlando
2bb80b487e
address issue with metrics calculations when redis cluster is used (#3510)
## Which issue(s) this PR fixes

Fixes this issue we started seeing popping up because of a change
introduced in #3496:
```python3
File "/etc/app/apps/metrics_exporter/views.py", line 22, in get
    result = generate_latest(application_metrics_registry).decode("utf-8")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/exposition.py", line 198, in generate_latest
    for metric in registry.collect():
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/registry.py", line 97, in collect
    yield from collector.collect()
  File "/etc/app/apps/metrics_exporter/metrics_collectors.py", line 56, in collect
    alert_groups_total, missing_org_ids_1 = self._get_alert_groups_total_metric(org_ids)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/etc/app/apps/metrics_exporter/metrics_collectors.py", line 97, in _get_alert_groups_total_metric
    org_id_from_key = RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'groups'
```



```python3
>>> import re
>>> ALERT_GROUPS_TOTAL = "oncall_alert_groups_total"
>>> _RE_BASE_PATTERN = r"{{?{}}}?_(\d+)"
>>> RE_ALERT_GROUPS_TOTAL = re.compile(_RE_BASE_PATTERN.format(ALERT_GROUPS_TOTAL))
>>> org_key = "{oncall_alert_groups_total}_1"
>>> RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
'1'
>>> org_key = "oncall_alert_groups_total_1"
>>> RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
'1'
```

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-05 12:12:08 -05:00
Joey Orlando
1df1b1eaa0
patch redis cluster multi-key operations (#3496)
# Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall-private/issues/2363

Addresses this issue that arises when using
`cache.get_many`/`cache.set_many` operations with a Redis Cluster:
```python3
File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 1006, in determine_slot
    raise RedisClusterException(
redis.exceptions.RedisClusterException: MGET - all keys must map to the same key slot
```

From the Redis Cluster
[docs](https://redis.io/docs/reference/cluster-spec/#hash-tags), this
can be addressed with this 👇 . Basically this will ensure that keys in
multi-key operations will resolve to the same hash slot (read: node):

> Hash tags
> There is an exception for the computation of the hash slot that is
used in order to implement hash tags. Hash tags are a way to ensure that
multiple keys are allocated in the same hash slot. This is used in order
to implement multi-key operations in Redis Cluster.
> 
> To implement hash tags, the hash slot for a key is computed in a
slightly different way in certain conditions. If the key contains a
"{...}" pattern only the substring between { and } is hashed in order to
obtain the hash slot. However since it is possible that there are
multiple occurrences of { or } the algorithm is well specified by the
following rules:
> 
> IF the key contains a { character.
> AND IF there is a } character to the right of {.
> AND IF there are one or more characters between the first occurrence
of { and the first occurrence of }.
> Then instead of hashing the key, only what is between the first
occurrence of { and the following first occurrence of } is hashed.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-04 13:08:57 -05:00
Vadim Stepanov
8188dd5dd2
Create missing direct paging integrations (#3468)
# What this PR does

Makes organization sync create direct paging integrations for Grafana
teams that don't have one.

## Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall-private/issues/2302

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-30 17:18:18 +00:00
Matias Bordese
7aa78f5f73
Enable flake8-bugbear, fix issues (#3454)
Enables [flake8-bugbear](https://github.com/PyCQA/flake8-bugbear),
checking for bugs/design problems, and [fixes the issues
found](https://pastebin.com/fEDBz6Ta) (some interesting ones,
particularly with mutable args).

Related to https://github.com/grafana/oncall/pull/3448
2023-11-29 15:04:48 +00:00
Ildar Iskhakov
2fdd885abe
Move new alert group metric creation into async task (#3451)
# What this PR does

Moving metrics creation into separate task to make alert ingestion more
robust

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Julia <ferril.darkdiver@gmail.com>
2023-11-29 12:45:36 +00:00
Yulya Artyukhina
4a8d0f3ad3
Fix metrics and dashboard (#2895)
# What this PR does

## Which issue(s) this PR fixes
https://github.com/grafana/oncall/issues/2897

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-08-29 13:52:24 +00:00
Joey Orlando
d6140cbe8d
Re-enable a few mypy rules + fix existing errors (#2725)
# What this PR does

Related to https://github.com/grafana/oncall/issues/2392

- Re-enable the following `mypy` rules + fix their pre-existing errors:
  - `no-redef`
  - `valid-type`
  - `var-annotated`
- Add stronger return typing to the `GrafanaAPIClient` by use of
generics + add some links to documentation in the method docstrings

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-08-03 09:43:03 +00:00
Vadim Stepanov
f977f9faee
Minor formatting changes (#2641)
# What this PR does

- Updates `black` and `flake8` to latest
- Removes `F541` from flake8 ignore (`F541 f-string is missing
placeholders`)
- Enables ["float to top"
option](https://pycqa.github.io/isort/docs/configuration/options.html#float-to-top)
for `isort`
2023-07-26 14:45:44 +01:00
Vadim Stepanov
b2f4ffb98a
apps.get_model -> import (#2619)
# What this PR does

Remove
[`apps.get_model`](https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.apps.get_model)
invocations and use inline `import` statements in places where models
are imported within functions/methods to avoid circular imports.

I believe `import` statements are more appropriate for most use cases as
they allow for better static code analysis & formatting, and solve the
issue of circular imports without being unnecessarily dynamic as
`apps.get_model`. With `import` statements, it's possible to:

- Jump to model definitions in most IDEs
- Automatically sort inline imports with `isort`
- Find import errors faster/easier (most IDEs highlight broken imports)
- Have more consistency across regular & inline imports when importing
models

This PR also adds a flake8 rule to ban imports of `django.apps.apps`, so
it's harder to use `apps.get_model` by mistake (it's possible to ignore
this rule by using `# noqa: I251`). The rule is not enforced on
directories with migration files, because `apps.get_model` is often used
to get a historical state of a model, which is useful when writing
migrations ([see this SO answer for more
details](https://stackoverflow.com/a/37769213)). So `apps.get_model` is
considered OK in migrations (even necessary in some cases).

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-07-25 09:43:23 +00:00
Yulya Artyukhina
f1fcb41fb4
Add "user_was_notified_of_alert_groups" metric (#2334)
This PR adds new metric for Prometheus exporter
"user_was_notified_of_alert_groups" which counts how many alert groups
user was notified of.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
2023-06-28 08:15:19 +00:00
Joey Orlando
75028d0427
continue addressing mypy violations (#2170)
# What this PR does

See #2173 

Also, closes #2187 . All of the new files under `type_stubs/icalendar`
were autogenerated by running:

```bash
stubgen -p icalendar -o type_stubs
```

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-06-27 10:23:08 +00:00
Matias Bordese
f4eb2bbf2e
Remove unused prefetch_related in metrics task (#2343)
The `prefecth_related` data here is not being used later (since we are
applying filters not getting `.all`) and it is preloading all alert
groups for integrations into memory (which can be a lot).
2023-06-26 14:12:10 +00:00
Yulya Artyukhina
af939dc83a
Optimize getting response time for alert groups (#2296)
Optimize getting response time for alert groups in calculation metrics
task

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-06-21 08:16:35 +00:00
Matias Bordese
f21bd8d0b5
Refactor metrics exporter to use cache.get_many (#2212)
Eventually we could also process orgs by chunks.
2023-06-14 12:14:29 +00:00
Yulya Artyukhina
0271cdb8a6
Save only organizations with integrations in cache (#2190)
Save only organizations with integrations in cache for collecting
metrics
2023-06-13 13:31:14 +00:00
Joey Orlando
8c5f6238dc
get rid of need for FEATURE_PROMETHEUS_EXPORTER_ENABLED to be present in ci-test.py settings file (#2161) 2023-06-12 09:43:05 -04:00
Matias Bordese
cc3c18c89c
Add instructions for prometheus exporter setup (#2103) 2023-06-12 13:04:07 +00:00
Yulya Artyukhina
15ef692009
OnCall prometheus metrics exporter (#1605)
# What this PR does
Add OnCall prometheus metrics exporter

## Which issue(s) this PR fixes

## Checklist

- [x] Tests updated
- [ ] Documentation added
- [ ] `CHANGELOG.md` updated

---------

Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
Co-authored-by: Matias Bordese <mbordese@gmail.com>
2023-05-25 18:26:13 +00:00