Commit graph

1276 commits

Author SHA1 Message Date
Joey Orlando
2bb80b487e
address issue with metrics calculations when redis cluster is used (#3510)
## Which issue(s) this PR fixes

Fixes this issue we started seeing popping up because of a change
introduced in #3496:
```python3
File "/etc/app/apps/metrics_exporter/views.py", line 22, in get
    result = generate_latest(application_metrics_registry).decode("utf-8")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/exposition.py", line 198, in generate_latest
    for metric in registry.collect():
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/registry.py", line 97, in collect
    yield from collector.collect()
  File "/etc/app/apps/metrics_exporter/metrics_collectors.py", line 56, in collect
    alert_groups_total, missing_org_ids_1 = self._get_alert_groups_total_metric(org_ids)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/etc/app/apps/metrics_exporter/metrics_collectors.py", line 97, in _get_alert_groups_total_metric
    org_id_from_key = RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'groups'
```



```python3
>>> import re
>>> ALERT_GROUPS_TOTAL = "oncall_alert_groups_total"
>>> _RE_BASE_PATTERN = r"{{?{}}}?_(\d+)"
>>> RE_ALERT_GROUPS_TOTAL = re.compile(_RE_BASE_PATTERN.format(ALERT_GROUPS_TOTAL))
>>> org_key = "{oncall_alert_groups_total}_1"
>>> RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
'1'
>>> org_key = "oncall_alert_groups_total_1"
>>> RE_ALERT_GROUPS_TOTAL.match(org_key).groups()[0]
'1'
```

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-05 12:12:08 -05:00
Matias Bordese
45200c33a1
Update beat schedule to use crontab schedule types (#3497)
Update celery beat schedule to use crontab schedule types, since
otherwise the timedelta is relative to the celery start and when we have
a restart we have some bigger than expected gaps between task runs
(alternatively it seems we could also use the `relative` option
described
[here](https://docs.celeryq.dev/en/main/userguide/periodic-tasks.html#available-fields))

Related to https://github.com/grafana/oncall-private/issues/2347
2023-12-04 18:42:12 +00:00
jorgeav
4df8985283
Jinja2 template helper filter datetimeformat_as_timezone (#3426)
# What this PR does
Add an additional jinja2 template helper filter to convert a timezone
aware datetime to a different timezone.

## Which issue(s) this PR fixes
Alert payloads that originate from different time zones may include
timestamps having a local time offset. This filter enables
standardization of timestamp timezones.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
2023-12-04 13:39:04 -05:00
Matias Bordese
e39baa6bbe
Revert "Refactor gcom api calls when syncing org" (#3498)
Reverts grafana/oncall#3489

Reviewing logs, it seems something broke related to [token
auth](https://ops.grafana-ops.net/explore?schemaVersion=1&panes=%7B%22ffS%22:%7B%22datasource%22:%22OP27Xzxnk%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D~%5C%22dev-.%2A%5C%22,%20namespace%3D%5C%22grafana-com%5C%22,%20job%3D%5C%22grafana-com%2Fgrafana-com-api%5C%22%7D%20%7C~%20%5C%22%2Finstances%2F%5Ba-z0-9%5D%2B.config%3Dtrue%5C%22%20%7C%3D%20%5C%22Grafana%20OnCall%5C%22%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22OP27Xzxnk%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1).
Reverting for now, will revisit in a later PR.
2023-12-04 18:02:58 +00:00
Joey Orlando
1df1b1eaa0
patch redis cluster multi-key operations (#3496)
# Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall-private/issues/2363

Addresses this issue that arises when using
`cache.get_many`/`cache.set_many` operations with a Redis Cluster:
```python3
File "/usr/local/lib/python3.11/site-packages/redis/cluster.py", line 1006, in determine_slot
    raise RedisClusterException(
redis.exceptions.RedisClusterException: MGET - all keys must map to the same key slot
```

From the Redis Cluster
[docs](https://redis.io/docs/reference/cluster-spec/#hash-tags), this
can be addressed with this 👇 . Basically this will ensure that keys in
multi-key operations will resolve to the same hash slot (read: node):

> Hash tags
> There is an exception for the computation of the hash slot that is
used in order to implement hash tags. Hash tags are a way to ensure that
multiple keys are allocated in the same hash slot. This is used in order
to implement multi-key operations in Redis Cluster.
> 
> To implement hash tags, the hash slot for a key is computed in a
slightly different way in certain conditions. If the key contains a
"{...}" pattern only the substring between { and } is hashed in order to
obtain the hash slot. However since it is possible that there are
multiple occurrences of { or } the algorithm is well specified by the
following rules:
> 
> IF the key contains a { character.
> AND IF there is a } character to the right of {.
> AND IF there are one or more characters between the first occurrence
of { and the first occurrence of }.
> Then instead of hashing the key, only what is between the first
occurrence of { and the following first occurrence of } is hashed.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-04 13:08:57 -05:00
Vadim Stepanov
9796489b8e
Skip empty alert group labels (#3495)
# What this PR does

Makes sure there are no empty alert group label values.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-04 13:46:03 +00:00
Vadim Stepanov
4ccfda58e5
Disallow creating and deleting direct paging integrations (#3475)
# What this PR does

Disallows creating and deleting direct paging integrations via both
internal and public APIs. It also hides the direct paging option in the
UI when creating a new integration.

## Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall-private/issues/2302

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Dominik <dominik.broj@grafana.com>
2023-12-04 13:13:53 +00:00
Matias Bordese
9eb09c0272
Refactor gcom api calls when syncing org (#3489)
Make one API call instead of two.
2023-12-04 13:08:59 +00:00
Innokentii Konstantinov
a3e3d8bc9d
Change labels feature flag to work per oncall org (#3493)
It's needed because anyway labels plugin provisioned per stack, not per
org

---------

Co-authored-by: Yulya Artyukhina <Ferril.darkdiver@gmail.com>
2023-12-04 12:45:07 +00:00
Vadim Stepanov
ea4f692646
Pin markdown2 to 2.4.10 (#3494)
# What this PR does

Pins markdown to 2.4.10 to fix [failing
test](https://github.com/grafana/oncall/actions/runs/7082829248/job/19276914906?pr=3490).

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-04 11:54:11 +00:00
Matias Bordese
6e35aadc0c
Add test ensuring integration endpoints work if redis cache is down (#3445) 2023-12-01 17:45:18 +00:00
Joey Orlando
e63151f724
revert uwsgi to 2.0.21 2023-12-01 11:59:22 -05:00
Joey Orlando
76a88bc0c1
Revert "upgrade to Python 3.12 (#3456)" and "bump uwsgi version to latest #3466" (#3483)
# What this PR does

This reverts commits 7c4b40a046 and
cdb22285db.

See https://github.com/grafana/oncall-private/pull/2361 for more
details.
2023-12-01 09:56:26 -05:00
Ildar Iskhakov
30caa18f9d
Make telegram on_alert_group_action_triggered asynchronous (#3471)
# What this PR does


[send_alert_group_signal](https://github.com/grafana/oncall/blob/dev/engine/apps/alerts/tasks/send_alert_group_signal.py#L12)
task is not idempotent. It launches
[on_alert_group_action_triggered_async](a2851d3f81/engine/apps/slack/representatives/alert_group_representative.py (L158))
for slack and then might fail on
[on_alert_group_action_triggered](b2f4ffb98a/engine/apps/telegram/alert_group_representative.py (L79))
(not async) due to database DoesNotExist exception.

This PR makes telegram representative asyncronous

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-01 10:49:00 +00:00
Vadim Stepanov
8188dd5dd2
Create missing direct paging integrations (#3468)
# What this PR does

Makes organization sync create direct paging integrations for Grafana
teams that don't have one.

## Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall-private/issues/2302

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-30 17:18:18 +00:00
Joey Orlando
7c4b40a046
upgrade to Python 3.12 (#3456)
# What this PR does

Upgrade to Python 3.12 + fix several invalid test assertions that lead
to test failures in the latest version of `pytest`:
```
AttributeError: 'called_once_with' is not a valid assertion. Use a spec for the mock if 'called_once_with' is meant to be an attribute.. Did you mean: 'assert_called_once_with'?
```

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-30 13:47:41 +00:00
Joey Orlando
cdb22285db
bump uwsgi version to latest (#3466)
# What this PR does

[Related
conversation](https://raintank-corp.slack.com/archives/C04JCU51NF8/p1701345765350519?thread_ts=1701291818.214999&cid=C04JCU51NF8)
2023-11-30 07:58:05 -05:00
Matias Bordese
aa8a904a8d
Update when slack client ratelimit retry handler is enabled (#3447) 2023-11-30 12:35:46 +00:00
Vadim Stepanov
381a9ecf54
Delete duplicate direct paging integrations (#3412)
# What this PR does

Deletes duplicate direct paging integrations (i.e. keeps only the first
direct paging integration per team).
Also adds a unique constraint that will make such duplicates impossible
at the DB level.

## Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall-private/issues/2302

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-30 11:19:12 +00:00
Matias Bordese
7aa78f5f73
Enable flake8-bugbear, fix issues (#3454)
Enables [flake8-bugbear](https://github.com/PyCQA/flake8-bugbear),
checking for bugs/design problems, and [fixes the issues
found](https://pastebin.com/fEDBz6Ta) (some interesting ones,
particularly with mutable args).

Related to https://github.com/grafana/oncall/pull/3448
2023-11-29 15:04:48 +00:00
Ildar Iskhakov
2fdd885abe
Move new alert group metric creation into async task (#3451)
# What this PR does

Moving metrics creation into separate task to make alert ingestion more
robust

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Julia <ferril.darkdiver@gmail.com>
2023-11-29 12:45:36 +00:00
Rares Mardare
455f74560c
Alert group column/label selector (#3281)
# What this PR does

Adds new functionality to enable which columns should show on the alert
group page


![image](https://github.com/grafana/oncall/assets/40542072/952d4004-9cd6-478c-a104-cd5d270cfd58)

---------

Co-authored-by: Julia <ferril.darkdiver@gmail.com>
2023-11-29 12:11:31 +00:00
Matias Bordese
ec1f120d9c
Update transaction.on_commit to use partial instead of lambda (#3448)
For reference,
https://adamj.eu/tech/2022/08/22/use-partial-with-djangos-transaction-on-commit/.

I think this may help with this one too:
https://github.com/grafana/oncall-private/issues/2318
2023-11-29 12:01:30 +00:00
Innokentii Konstantinov
8c82dac6db
Rewrite LabelsAPIClient (#3422)
Rewrite LabelAPIClient to be able to return error messages from Label
Repo API.
Main features:
1. Raises LabelRepoAPIException when response code is 400 or 500 level.
2. Always return response as a second argument to further inspect it, if
necessary.
2023-11-29 08:56:42 +00:00
Dominik Broj
2c4b34d3af
generate types, create http client and add exemplary usage (#3384)
# What this PR does

https://github.com/grafana/oncall/issues/3330

- add a script that generates TS type definitions based on OnCall API
OpenAPI schemas
- support adding custom properties on the frontend if needed
- add simple example of usage (for `'/labels/keys/'` endpoint)

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-29 05:16:13 +00:00
Michael Derynck
e9f2178da1
Change service account auth to use instance id instead (#3435)
# What this PR does
Change GrafanaServiceAccountAuth to use instance ID header in cloud
instead of slugs.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-28 15:56:29 +00:00
Innokentii Konstantinov
fb4bad21d2
Document tojson jinja filter (#3432)
# What this PR does
Document tojson jinja filter
2023-11-28 20:47:57 +08:00
Innokentii Konstantinov
ccc64e6b90 Fix 2023-11-28 13:22:18 +08:00
Innokentii Konstantinov
8f17784c49 Fix 2023-11-28 13:16:55 +08:00
Innokentii Konstantinov
22cfba0163 Remove call to bots_info in the slack message handler 2023-11-28 13:14:33 +08:00
Ildar Iskhakov
393c8e06a7
Merge branch 'main' into dev 2023-11-28 09:59:07 +08:00
Vadim Stepanov
9e889403f2
Alert group payload labels (#3434)
https://github.com/grafana/oncall/pull/3385 + handle null values
2023-11-27 17:53:54 +00:00
Vadim Stepanov
e09422a07d
Revert "Alert group payload labels" (#3433)
Reverts grafana/oncall#3385
2023-11-27 17:28:34 +00:00
Vadim Stepanov
5fac6aeac5
Alert group payload labels (#3385)
# What this PR does

Adds an ability to extract labels from alert group payload. See
[demo](https://www.loom.com/share/cf2b746eea974547b76f44298e32a54f?sid=67ed1e58-40ed-4136-a201-6482fb7773d3).

## Which issue(s) this PR fixes

https://github.com/grafana/oncall-private/issues/2304

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Maxim Mordasov <maxim.mordasov@grafana.com>
Co-authored-by: Rares Mardare <rares.mardare@grafana.com>
2023-11-27 16:55:31 +00:00
Ildar Iskhakov
2acd85a3e6 Do not check redis during startupprobe 2023-11-27 21:43:26 +08:00
Ildar Iskhakov
2ddb47ea29
Revert "Make alert ingestion cache independent (#3414)" (#3430)
This reverts commit acfba47a81.

# What this PR does

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-27 20:23:57 +08:00
Ildar Iskhakov
aab49f0594
Revert "Add test to ensure integrations work when cache is down" (#3431)
Reverts grafana/oncall#3418
2023-11-27 20:10:26 +08:00
Innokentii Konstantinov
85f9b0f168
Log slack bot_id and bot_user_it (#3429)
Log slack bot id and bot user id to check if we can avoid request to
slack api
2023-11-27 18:36:53 +08:00
Yulya Artyukhina
863af25994
Fix alert group rendering (#3424)
# What this PR does
Fix alert group rendering when some links were broken because of
replacing `-` to `_`.

## Which issue(s) this PR fixes
https://github.com/grafana/support-escalations/issues/8119
https://github.com/grafana/support-escalations/issues/8468

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-24 15:39:37 +00:00
Matias Bordese
0816825c5d
Add test to ensure integrations work when cache is down (#3418)
Complements https://github.com/grafana/oncall/pull/3414/
2023-11-24 12:02:36 +00:00
Matias Bordese
d730f6b2bf
Trigger distribute task after alert is committed (#3420)
Fix issue triggering task retries because alert is not yet committed to
the DB.
Similar to https://github.com/grafana/oncall/pull/3001.
2023-11-24 12:02:32 +00:00
Michael Derynck
3436344f1d
Allow users with user settings read to list users (#3419)
# What this PR does
Fixed issue where `User Settings Reader` was missing permission to list
users.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-23 20:41:27 +00:00
Matias Bordese
55fedb25d6
Rework alert group internal API team filter (#3413)
Related to https://github.com/grafana/oncall-private/issues/2177
2023-11-23 17:28:00 +00:00
Michael Derynck
60ef4348f5
Allow OnCall API to use Grafana Service Accounts (#3189)
# What this PR does
Allows public OnCall API to use Grafana service accounts for
authorization. In cloud requests using a Grafana service account token
also needs to provide headers for `X-Grafana-Org-Slug` and
`X-Grafana-Instance-Slug`

This is **alpha** functionality, it may break or be removed in the
future. Going to use this on one endpoint (resolution notes) before we
consider the implications across all of public API.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-23 16:42:27 +00:00
Ildar Iskhakov
95a3ab3b75
Revert "Cache independent ingestion" (#3417)
Reverts grafana/oncall#3415
2023-11-23 21:38:06 +08:00
Ildar Iskhakov
acfba47a81
Make alert ingestion cache independent (#3414)
# What this PR does

This PR catches redis unavailability exceptions to prevent errors during
alert ingestion in the following places:
StartupProbeView
AlertChannelDefiningMixin
RateLimitMixin

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-23 16:32:02 +08:00
Ildar Iskhakov
a6912c96af
Merge pull request #3415 from grafana/iskhakov/cache-independent-ingestion
Cache independent ingestion
2023-11-23 16:22:18 +08:00
Ildar Iskhakov
566e8c53ba Ignore typing checks for imported library (https://mypy.readthedocs.io/en/stable/running_mypy.html\#missing-library-stubs-or-py-typed-marker) 2023-11-23 16:14:30 +08:00
Ildar Iskhakov
0d5ef785bf Make alert ingestion cache independent 2023-11-23 11:27:47 +08:00
Matias Bordese
c0a6e69d9f
Remove unused prefetch; prefer indexed query (#3410)
This should help with slowness in webhooks listing page (do not fetch
*all* responses for webhooks, which besides were not used in the
serializer).
2023-11-23 02:08:04 +00:00