# What this PR does
- Since send_alert_create_signal is inside transaction on_commit we can
conclude that if it does not exist it was intentionally deleted before
the task could run and the task can exit instead of retrying
- Improve logging when send_alert_create_signal is called so both alert
and alert group are in the same line so you don't need to search the
logs as much
- Improve logging on public api delete alert group so we can know what
the alert group belonged to and the responsible user/org
- Remove distribute_alerts (Stopped using a while back, code should be
safe to remove now, no tasks running in system)
## Which issue(s) this PR closes
Closes https://github.com/grafana/oncall-private/issues/2640
<!--
*Note*: if you have more than one GitHub issue that this PR closes, be
sure to preface
each issue link with a [closing
keyword](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue).
This ensures that the issue(s) are auto-closed once the PR has been
merged.
-->
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
show up in the autogenerated release notes.
# What this PR does
- removes unused "custom button" backend code now that we've migrated to
outgoing webhooks
- adds new e2e test for webhooks asserting that an `ngrok`/`express`
webhook handler receives the call as expected + payload is as expected
(related to https://github.com/grafana/oncall/issues/2691) - skipped for
now, the test passes locally but fails on GitHub Actions CI, seems to be
networking related
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
---------
Co-authored-by: Michael Derynck <michael.derynck@grafana.com>
# What this PR does
Count sms with status "accepted" as delivered in notification checker
## Which issue(s) this PR fixes
https://raintank-corp.slack.com/archives/C025VMT6SPK/p1706799009342889?thread_ts=1706786822.083149&cid=C025VMT6SPK
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Add transactions around log record creation and check transaction
on_commit before sending signals passing DB id of alert group log
records. In cases for delete we can then assume any missing IDs on tasks
are from intentionally deleted alert groups and we can stop tasks from
retrying endlessly.
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This PR simplifies alert group/alert creation, so the alert created and
escalation started in the same task.
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Use Telegram ratelimit countdown when retry `perform_notification` task
on `RetryAfter` error
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/2451
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Closes https://github.com/grafana/oncall-private/issues/2449
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Speed up escalation auditor
- use raw escalation snapshot instead of serialized one
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Stops rescheduling of `acknowledge_reminder_task` after 2 weeks.
Assumption being if it has been sitting for that long in acknowledged
state it is likely to not need more reminders that it is still
acknowledged. Notifications for thread were probably muted a long time
ago.
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Updates check if escalation was skipped in Slack before trying to notify
user by Slack.
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
There is no index for the `received_at` column, and the filter isn't
really needed (aggregation will work in any case, considering only the
entries for which we have data).
# What this PR does
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Moving metrics creation into separate task to make alert ingestion more
robust
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Julia <ferril.darkdiver@gmail.com>
# What this PR does
add more logging in `apps.alerts.tasks.notify_user.perform_notification`
task (related to https://github.com/grafana/oncall-private/issues/2318;
won't solve that issue but will help with the investigation)
# What this PR does
Fix acknowledge reminder:
- check if organization was deleted
- improve logging
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This PR adds alert groups success ratio over last 48 hours
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Add an ability to continue escalations for alert groups from the point
it was in case if it was stopped
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
Prefer partial since it is the what the [docs
suggests](https://docs.djangoproject.com/en/4.2/topics/db/transactions/#performing-actions-after-commit).
Also because partial is evaluated immediately while lambda is evaluated
at runtime (which may be causing some issues):
```
>>> from functools import partial
>>> def foo(a, b, c):
... print(a, b, c)
...
>>> x = 10
>>> bar_partial = partial(foo, 1, 2, x)
>>> bar_lambda = lambda: foo(1, 2, x)
>>> x = 20
>>> bar_partial()
1 2 10
>>> bar_lambda()
1 2 20
```
# What this PR does
- Invalidate alert group cache on wipe
- Improve public API docs on alert group deletion
- Add / improve tests
## Which issue(s) this PR fixes
Related to https://github.com/grafana/oncall/issues/3051
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
* Create Direct Paging integration (with default route) when team is
created with bulk_update
* Create notification policies when user is created with bulk_update
* If user notification policies are empty change it to Email
* Minor markup and wording improvements
* Add grafana queue to helm chart
* Remove disabled commands for redis helm chart
* Improve Dockerfile caching
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
- gracefully retry
`apps.alerts.tasks.delete_alert_group.delete_alert_group` when hitting
Slack ratelimits
- remove Slack messages from the DB as soon as they are deleted from
Slack, so the tasks are not retrying perpetually
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
- Rename `SlackClientWithErrorHandling` to just `SlackClient`
- Add more error classes + improve the way errors are raised based on
the Slack error code
- Add API call retries on Slack server errors (e.g. when Slack returns
`5xx` errors)
- Refactor some methods working with Slack API + add tests
## Which issue(s) this PR fixes
- https://github.com/grafana/oncall-private/issues/1837
- https://github.com/grafana/oncall-private/issues/1840
- https://github.com/grafana/oncall-private/issues/1842
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
- update `slackclient` dependency to latest version. The version we were
using was 5 years old 😲
- first followed the v2 migration guide
[here](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x)
followed by the v3 migration guide
[here](https://slack.dev/python-slack-sdk/v3-migration/). The main
changes were:
- The PyPI project was renamed from `slackclient` to `slack_sdk`
- it is discouraged/harder to call `api_call` and encouraged to call the
helper methods (ex. `chat_postMessage`;
[note](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x#web-client-api-changes)
in migration guide docs)
- In 1.x, a failed api call would return the error payload to you and
have you handle the error. In 2.x, a failed api call will throw an
exception. To handle this in your code, you will have to wrap api calls
with a try except block. Since we overload `WebClient.api_call` this was
an easy change and only required a one line change
- remove `apps.slack.slack_client.slack_server.SlackClientServer` class.
The new version of `slack_sdk` handles the case that we needed to
overload for in the first place.
- merged `apps/slack/slack_client/slack_client.py` and
`apps/slack/slack_client/exceptions.py` into `apps/slack/client.py`
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Refactors the foreign key relationship between `AlertGroup` and
`SlackMessage` models (see
https://github.com/grafana/oncall-private/issues/1867 for more info).
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1867
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
These celery tasks have not been used for more than one week (since
[v1.3.25](https://github.com/grafana/oncall/releases/tag/v1.3.25) which
released an improvement for Grafana Alerting integration)
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Clean up check escalation finished task, update description
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Exclude the latest alert groups from escalation finished check to give
them time for building escalation snapshot
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/2028
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Update `next_step_eta` in alert group escalation snapshot when alert
group is silenced for period
## Which issue(s) this PR fixes
Fixes the issue related to [this
one](https://github.com/grafana/oncall-private/issues/2028)
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
Co-authored-by: Joey Orlando <joseph.t.orlando@gmail.com>
This PR improves Grafana Alerting integration:
- get alerting contact points "on fly" instead of keeping them in db
- add ability to connect more than one contact point
- add ability to create new contact point on create Grafana Alerting
integration
- show warnings in integration settings for non-active contact points
- remove creation alerting notification policies on create Grafana
Alerting integration
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Rares Mardare <rares.mardare@grafana.com>
# What this PR does
Fixes a bug with Slack acknowledgment reminders not being sent (+ some
refactoring and unit tests).
## Which issue(s) this PR fixes
https://github.com/grafana/oncall/issues/2756
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Update schedule slack notifications to use schedule final events instead
of getting events from iCal
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)