Commit graph

117 commits

Author SHA1 Message Date
Matias Bordese
2aa8639e2a
Update escalation auditor logs to expose succeeding count (#4431)
Related to https://github.com/grafana/oncall-private/issues/2619
(we need the succeeding number to make the SLO query happy with
cluster/namespace filtering)
2024-05-31 18:29:31 +00:00
Matias Bordese
08d1e00430
Update escalation auditor to log total and failed escalations info (#4425)
Related to https://github.com/grafana/oncall-private/issues/2619
2024-05-30 18:53:53 +00:00
Matias Bordese
d316c9121e
Fix order filtering when executing notify all/group steps from snapshot (#4381)
Fixes https://github.com/grafana/oncall-private/issues/2708
2024-05-23 12:36:28 +00:00
Matias Bordese
65ee57f563
Ignore uncompleted notifications if policy is deleted (#4260)
Related to https://github.com/grafana/oncall-private/issues/2637
2024-04-23 11:40:24 +00:00
Michael Derynck
d75590b943
Handle alert group deleted when task is already queued (#4230)
# What this PR does
- Since send_alert_create_signal is inside transaction on_commit we can
conclude that if it does not exist it was intentionally deleted before
the task could run and the task can exit instead of retrying
- Improve logging when send_alert_create_signal is called so both alert
and alert group are in the same line so you don't need to search the
logs as much
- Improve logging on public api delete alert group so we can know what
the alert group belonged to and the responsible user/org
- Remove distribute_alerts (Stopped using a while back, code should be
safe to remove now, no tasks running in system)

## Which issue(s) this PR closes

Closes https://github.com/grafana/oncall-private/issues/2640

<!--
*Note*: if you have more than one GitHub issue that this PR closes, be
sure to preface
each issue link with a [closing
keyword](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue).
This ensures that the issue(s) are auto-closed once the PR has been
merged.
-->

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-04-16 14:39:00 +00:00
Joey Orlando
c5cd675738
cleanup CustomButton backend code + add ngrok/express outgoing webhook e2e test (#2544)
# What this PR does

- removes unused "custom button" backend code now that we've migrated to
outgoing webhooks
- adds new e2e test for webhooks asserting that an `ngrok`/`express`
webhook handler receives the call as expected + payload is as expected
(related to https://github.com/grafana/oncall/issues/2691) - skipped for
now, the test passes locally but fails on GitHub Actions CI, seems to be
networking related
 
## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)

---------

Co-authored-by: Michael Derynck <michael.derynck@grafana.com>
2024-03-28 15:37:22 +00:00
Yulya Artyukhina
ba122ec6ef
Update notification checker (#3818)
# What this PR does
Count sms with status "accepted" as delivered in notification checker
## Which issue(s) this PR fixes

https://raintank-corp.slack.com/archives/C025VMT6SPK/p1706799009342889?thread_ts=1706786822.083149&cid=C025VMT6SPK
## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-02-01 15:42:43 +00:00
Michael Derynck
2a466a0c4f
Add transaction on_commit before signals for alert group actions (#3731)
# What this PR does
Add transactions around log record creation and check transaction
on_commit before sending signals passing DB id of alert group log
records. In cases for delete we can then assume any missing IDs on tasks
are from intentionally deleted alert groups and we can stop tasks from
retrying endlessly.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-01-31 15:54:50 -07:00
Matias Bordese
3795c836d1
Add transaction block and callbacks when triggering tasks (#3779)
Related to https://github.com/grafana/oncall/issues/3729
2024-01-31 09:26:14 -05:00
Ildar Iskhakov
401d279d54
Refactor create_alert task (#3759)
# What this PR does

This PR simplifies alert group/alert creation, so the alert created and
escalation started in the same task.

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-01-30 08:39:04 +00:00
Yulya Artyukhina
19cae8086e
Retry perform_notification with Telegram ratelimit countdown on RetryAfter error (#3744)
# What this PR does
Use Telegram ratelimit countdown when retry `perform_notification` task
on `RetryAfter` error
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/2451

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-01-24 15:31:56 +00:00
Joey Orlando
f27aa48dcb
address typo in partial passed to transaction.on_commit 2024-01-18 09:09:48 -05:00
Joey Orlando
909aacd8b8
change perform_notification.apply_async
transaction.on_commit back to using partial
2024-01-18 07:58:21 -05:00
Joey Orlando
16b648bd15
fix infinitely retrying apps.alerts.tasks.notify_user.perform_notification task (#3708)
# Which issue(s) this PR fixes

Closes https://github.com/grafana/oncall-private/issues/2318

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-01-18 07:07:01 -05:00
Matias Bordese
2fd456fc77
Update alert group personal notifications checker to check sent SMS (#3698)
Sent SMS messages are considered completed for our purpose here (ie. do
not wait for Twilio delivered confirmation).
2024-01-17 17:46:18 +00:00
Joey Orlando
4036ced9b9
add LogExceptionOnFailureTask celery task class (#3677)
# What this PR does

Closes https://github.com/grafana/oncall-private/issues/2449

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-01-12 21:31:01 +00:00
Matias Bordese
4e2e7e0a15
Add task logging personal notifications triggered/completed counts (#3638)
Related to https://github.com/grafana/oncall-private/issues/2347
2024-01-10 18:54:27 +00:00
Matias Bordese
f68b9dd004
Update auditor to check personal notifications (#3563)
Requires https://github.com/grafana/oncall/pull/3557

Related to https://github.com/grafana/oncall-private/issues/2347
2023-12-18 16:13:18 +00:00
Yulya Artyukhina
36227418ed
Speed up escalation auditor (#3578)
# What this PR does
Speed up escalation auditor
- use raw escalation snapshot instead of serialized one

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-18 12:28:55 +00:00
Michael Derynck
e7f3eff72c
Limit how long acknowledge reminders can run for (#3571)
# What this PR does
Stops rescheduling of `acknowledge_reminder_task` after 2 weeks.
Assumption being if it has been sitting for that long in acknowledged
state it is likely to not need more reminders that it is still
acknowledged. Notifications for thread were probably muted a long time
ago.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-15 16:50:01 +00:00
Yulya Artyukhina
2b62da77b7
Check if escalation was skipped in Slack before trying to notify user (#3562)
# What this PR does
Updates check if escalation was skipped in Slack before trying to notify
user by Slack.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-15 09:33:01 +00:00
Matias Bordese
e260e23715
Add missing success log entries for personal notifications (#3557) 2023-12-14 18:32:26 +00:00
Matias Bordese
6dada51133
Remove unneeded filter making query slower (#3570)
There is no index for the `received_at` column, and the filter isn't
really needed (aggregation will work in any case, considering only the
entries for which we have data).
2023-12-14 18:25:34 +00:00
Yulya Artyukhina
8a6510badd
Fix task retries for deleted alert groups (#3553)
# What this PR does

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-12 12:01:47 +00:00
Joey Orlando
16ba87bff6
Don't update alert group metrics when deleting an alert group (#3544)
# Which issue(s) this PR fixes

Fixes https://github.com/grafana/oncall-private/issues/2376

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-11 12:16:00 -05:00
Matias Bordese
3feba3675b
Log average/max delta between alert ingestion and alert group creation (#3526)
Related to https://github.com/grafana/oncall-private/issues/2347
2023-12-07 16:03:41 +00:00
Matias Bordese
aa8a904a8d
Update when slack client ratelimit retry handler is enabled (#3447) 2023-11-30 12:35:46 +00:00
Ildar Iskhakov
2fdd885abe
Move new alert group metric creation into async task (#3451)
# What this PR does

Moving metrics creation into separate task to make alert ingestion more
robust

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Julia <ferril.darkdiver@gmail.com>
2023-11-29 12:45:36 +00:00
Matias Bordese
ec1f120d9c
Update transaction.on_commit to use partial instead of lambda (#3448)
For reference,
https://adamj.eu/tech/2022/08/22/use-partial-with-djangos-transaction-on-commit/.

I think this may help with this one too:
https://github.com/grafana/oncall-private/issues/2318
2023-11-29 12:01:30 +00:00
Joey Orlando
f70c439334
add more logging in apps.alerts.tasks.notify_user.perform_notification task (#3403)
# What this PR does

add more logging in `apps.alerts.tasks.notify_user.perform_notification`
task (related to https://github.com/grafana/oncall-private/issues/2318;
won't solve that issue but will help with the investigation)
2023-11-21 13:35:23 -05:00
Yulya Artyukhina
d7d5c3aa28
Fix acknowledge reminder (#3345)
# What this PR does
Fix acknowledge reminder:
- check if organization was deleted
- improve logging

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-14 13:39:27 +00:00
Ildar Iskhakov
784c5ee7c1
Add notifications success ratio log to auditor (#3312)
# What this PR does

This PR adds alert groups success ratio over last 48 hours

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-10 16:39:13 +08:00
Yulya Artyukhina
7552de13e5
Add a command to continue escalations for alert groups (#3283)
# What this PR does
Add an ability to continue escalations for alert groups from the point
it was in case if it was stopped

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-07 13:44:23 +00:00
Matias Bordese
cc9dc66437
Move cache clear to fixtures, fix some deprecation notices (#3269) 2023-11-06 16:52:50 +00:00
Matias Bordese
88f9f118a3
Use partial instead of lambda when queuing ack task (#3134)
Prefer partial since it is the what the [docs
suggests](https://docs.djangoproject.com/en/4.2/topics/db/transactions/#performing-actions-after-commit).
Also because partial is evaluated immediately while lambda is evaluated
at runtime (which may be causing some issues):

```
>>> from functools import partial
>>> def foo(a, b, c):
...   print(a, b, c)
... 
>>> x = 10
>>> bar_partial = partial(foo, 1, 2, x)
>>> bar_lambda = lambda: foo(1, 2, x)
>>> x = 20
>>> bar_partial()
1 2 10
>>> bar_lambda()
1 2 20
```
2023-10-06 18:31:11 +00:00
Vadim Stepanov
1acb7018d0
Improve alert group deletion API (#3124)
# What this PR does

- Invalidate alert group cache on wipe
- Improve public API docs on alert group deletion
- Add / improve tests

## Which issue(s) this PR fixes

Related to https://github.com/grafana/oncall/issues/3051

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-10-05 14:32:40 +01:00
Matias Bordese
29eae9b2c6
Fix slack notification for shift end affected by taken swap (#3092)
Related to https://github.com/grafana/oncall/issues/3096
2023-10-02 12:56:07 +00:00
Ildar Iskhakov
51014735aa
WIP: Direct paging improvements (#3064)
# What this PR does
* Create Direct Paging integration (with default route) when team is
created with bulk_update
* Create notification policies when user is created with bulk_update
* If user notification policies are empty change it to Email
* Minor markup and wording improvements
* Add grafana queue to helm chart
* Remove disabled commands for redis helm chart
* Improve Dockerfile caching

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-28 03:57:49 +00:00
Vadim Stepanov
6caacf4048
Handle Slack ratelimit on alert group deletion (#3038)
# What this PR does

- gracefully retry
`apps.alerts.tasks.delete_alert_group.delete_alert_group` when hitting
Slack ratelimits
- remove Slack messages from the DB as soon as they are deleted from
Slack, so the tasks are not retrying perpetually

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-19 08:41:47 +00:00
Vadim Stepanov
8b2212c7dc
Improve Slack error handling (#3000)
# What this PR does

- Rename `SlackClientWithErrorHandling` to just `SlackClient`
- Add more error classes + improve the way errors are raised based on
the Slack error code
- Add API call retries on Slack server errors (e.g. when Slack returns
`5xx` errors)
- Refactor some methods working with Slack API + add tests

## Which issue(s) this PR fixes

- https://github.com/grafana/oncall-private/issues/1837
- https://github.com/grafana/oncall-private/issues/1840
- https://github.com/grafana/oncall-private/issues/1842

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-12 09:49:16 +00:00
Joey Orlando
a9155130df
update slack_sdk dependency to latest version (#2947)
# What this PR does

- update `slackclient` dependency to latest version. The version we were
using was 5 years old 😲
- first followed the v2 migration guide
[here](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x)
followed by the v3 migration guide
[here](https://slack.dev/python-slack-sdk/v3-migration/). The main
changes were:
    - The PyPI project was renamed from `slackclient` to `slack_sdk`
- it is discouraged/harder to call `api_call` and encouraged to call the
helper methods (ex. `chat_postMessage`;
[note](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x#web-client-api-changes)
in migration guide docs)
- In 1.x, a failed api call would return the error payload to you and
have you handle the error. In 2.x, a failed api call will throw an
exception. To handle this in your code, you will have to wrap api calls
with a try except block. Since we overload `WebClient.api_call` this was
an easy change and only required a one line change
- remove `apps.slack.slack_client.slack_server.SlackClientServer` class.
The new version of `slack_sdk` handles the case that we needed to
overload for in the first place.
- merged `apps/slack/slack_client/slack_client.py` and
`apps/slack/slack_client/exceptions.py` into `apps/slack/client.py`

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-05 11:31:59 +02:00
Vadim Stepanov
a2851d3f81
Refactor AlertGroup.slack_message (#2957)
# What this PR does

Refactors the foreign key relationship between `AlertGroup` and
`SlackMessage` models (see
https://github.com/grafana/oncall-private/issues/1867 for more info).

## Which issue(s) this PR fixes

https://github.com/grafana/oncall-private/issues/1867

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-04 18:03:18 +00:00
Yulya Artyukhina
e472b03e1d
Remove deprecated alerting integration tasks (#2944)
# What this PR does
These celery tasks have not been used for more than one week (since
[v1.3.25](https://github.com/grafana/oncall/releases/tag/v1.3.25) which
released an improvement for Grafana Alerting integration)

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-01 13:11:02 +00:00
Yulya Artyukhina
361d45dd02
Clean up check escalation finished task (#2943)
# What this PR does
Clean up check escalation finished task, update description

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-09-01 10:48:47 +00:00
Yulya Artyukhina
4cff4f2fa9
Exclude the latest alert groups from escalation finished check (#2913)
# What this PR does
Exclude the latest alert groups from escalation finished check to give
them time for building escalation snapshot

## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/2028

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-08-30 09:28:09 +00:00
Yulya Artyukhina
5bc7351671
Fix next step eta for silenced alert groups (#2887)
# What this PR does
Update `next_step_eta` in alert group escalation snapshot when alert
group is silenced for period

## Which issue(s) this PR fixes
Fixes the issue related to [this
one](https://github.com/grafana/oncall-private/issues/2028)

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
Co-authored-by: Joey Orlando <joseph.t.orlando@gmail.com>
2023-08-28 12:13:01 +00:00
Yulya Artyukhina
58a9a39efe
Improve getting/updating contact points for Grafana Alerting integration (#2742)
This PR improves Grafana Alerting integration:
- get alerting contact points "on fly" instead of keeping them in db
- add ability to connect more than one contact point
- add ability to create new contact point on create Grafana Alerting
integration
- show warnings in integration settings for non-active contact points
- remove creation alerting notification policies on create Grafana
Alerting integration

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Rares Mardare <rares.mardare@grafana.com>
2023-08-18 12:12:29 +02:00
Vadim Stepanov
6f0921a3e4
Fix Slack acknowledgment reminders (#2769)
# What this PR does

Fixes a bug with Slack acknowledgment reminders not being sent (+ some
refactoring and unit tests).

## Which issue(s) this PR fixes

https://github.com/grafana/oncall/issues/2756

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-08-11 09:41:56 +00:00
Matias Bordese
bb9f647608
Filter out untaken swaps from final schedule and shift notifications (#2748)
Avoid creating (or notifying) about potential event splits resulting
from untaken swap requests.
2023-08-04 17:43:54 +00:00
Yulya Artyukhina
0494afac85
Update schedule slack notifications (#2710)
# What this PR does

Update schedule slack notifications to use schedule final events instead
of getting events from iCal

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-08-03 12:38:01 +00:00