# What this PR does
Automatically retries escalation when alert groups fail auditing. This
is the same effect as the continue_escalation command without any of the
extra arguments.
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
show up in the autogenerated release notes.
# What this PR does
Count sms with status "accepted" as delivered in notification checker
## Which issue(s) this PR fixes
https://raintank-corp.slack.com/archives/C025VMT6SPK/p1706799009342889?thread_ts=1706786822.083149&cid=C025VMT6SPK
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Speed up escalation auditor
- use raw escalation snapshot instead of serialized one
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
There is no index for the `received_at` column, and the filter isn't
really needed (aggregation will work in any case, considering only the
entries for which we have data).
# What this PR does
This PR adds alert groups success ratio over last 48 hours
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Update `next_step_eta` in alert group escalation snapshot when alert
group is silenced for period
## Which issue(s) this PR fixes
Fixes the issue related to [this
one](https://github.com/grafana/oncall-private/issues/2028)
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
Co-authored-by: Joey Orlando <joseph.t.orlando@gmail.com>
# What this PR does
This PR adds some enhancements to the `check_escalation_finished` celery
task. It short-circuits auditing of an alert group if it does not have
an escalation chain associated with it. In
`EscalationSnapshotMixin.start_escalation_if_needed`
we will not set `raw_escalation_snapshot`
([here](https://github.com/grafana/oncall/blob/dev/engine/apps/alerts/escalation_snapshot/escalation_snapshot_mixin.py#L262))
in this case:
```python3
def start_escalation_if_needed(self, countdown=START_ESCALATION_DELAY, eta=None):
"""
:type self:AlertGroup
"""
AlertGroup = apps.get_model("alerts", "AlertGroup")
is_on_maintenace_or_debug_mode = self.channel.maintenance_mode is not None
if (
self.is_restricted
or is_on_maintenace_or_debug_mode
or self.pause_escalation
or not self.escalation_chain_exists <-- here
):
logger.debug(
f"Not escalating alert group w/ pk: {self.pk}\n"
f"is_restricted: {self.is_restricted}\n"
f"is_on_maintenace_or_debug_mode: {is_on_maintenace_or_debug_mode}\n"
f"pause_escalation: {self.pause_escalation}\n"
f"escalation_chain_exists: {self.escalation_chain_exists}"
)
return
logger.debug(f"Start escalation for alert group with pk: {self.pk}")
# take raw escalation snapshot from db if escalation is paused
raw_escalation_snapshot = (
self.build_raw_escalation_snapshot() if not self.pause_escalation else self.raw_escalation_snapshot
)
task_id = celery_uuid()
AlertGroup.all_objects.filter(pk=self.pk,).update(
active_escalation_id=task_id,
is_escalation_finished=False,
raw_escalation_snapshot=raw_escalation_snapshot,
)
```
`EscalationSnapshotMixin.escalation_chain_exists` is as such:
```python3
@property
def escalation_chain_exists(self) -> bool:
if self.pause_escalation:
return False
elif not self.channel_filter:
return False
return self.channel_filter.escalation_chain is not None
```
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required) (N/A)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required) (N/A)
# What this PR does
Adds an index on the `started_at` column in the `alerts_alertgroup`
table. For the alert groups query used by the
`check_escalation_finished_task`, this resulted in a huge performance
boost, taking the query time from 89mins to 4secs (on our largest
production dataset).
## Which issue(s) this PR fixes
closes#724
closes https://github.com/grafana/oncall-private/issues/1713
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This PR:
- modifies the `check_escalation_finished_task` celery task to:
- do stricter escalation validation based on the alert group's
escalation snapshot (see the `audit_alert_group_escalation` method in
`engine/apps/alerts/tasks/check_escalation_finished.py` for the
validation logic)
- use a read-only database for querying alert-groups if one is
configured, otherwise use the "default" one
- ping a configurable heartbeat (new env var
`ALERT_GROUP_ESCALATION_AUDITOR_CELERY_TASK_HEARTBEAT_URL` added)
- increase the task frequency from every 10 to every 13 minutes (this
can be configured via an env variable)
- adds public documentation on how to configure this auditor task
- modifies the local celery startup command to properly take into
consideration all celery related env vars (similar to the ones we use in
`engine/celery_with_exporter.sh`; this made it easier to enable `celery
beat` locally for testing)
- removes the following code:
- removes references to `AlertGroup.estimate_escalation_finish_time` and
marks the model field as deprecated using the [`django-deprecate-fields`
library](https://pypi.org/project/django-deprecate-fields/). This field
was only used for the previous version of this validation task
- `EscalationSnapshotMixin.calculate_eta_for_finish_escalation` was only
used to calculate the value for
`AlertGroup.estimate_escalation_finish_time`
- `calculate_escalation_finish_time` celery task
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1558
## Checklist
- [x] Tests updated
- [x] Documentation added
- [x] `CHANGELOG.md` updated