Commit graph

15 commits

Author SHA1 Message Date
Michael Derynck
2024ee7f78
feat: Auto retry escalation on failed audit (#5265)
# What this PR does
Automatically retries escalation when alert groups fail auditing. This
is the same effect as the continue_escalation command without any of the
extra arguments.

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
    show up in the autogenerated release notes.
2024-11-19 22:23:15 +00:00
Matias Bordese
65ee57f563
Ignore uncompleted notifications if policy is deleted (#4260)
Related to https://github.com/grafana/oncall-private/issues/2637
2024-04-23 11:40:24 +00:00
Yulya Artyukhina
ba122ec6ef
Update notification checker (#3818)
# What this PR does
Count sms with status "accepted" as delivered in notification checker
## Which issue(s) this PR fixes

https://raintank-corp.slack.com/archives/C025VMT6SPK/p1706799009342889?thread_ts=1706786822.083149&cid=C025VMT6SPK
## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2024-02-01 15:42:43 +00:00
Matias Bordese
2fd456fc77
Update alert group personal notifications checker to check sent SMS (#3698)
Sent SMS messages are considered completed for our purpose here (ie. do
not wait for Twilio delivered confirmation).
2024-01-17 17:46:18 +00:00
Matias Bordese
4e2e7e0a15
Add task logging personal notifications triggered/completed counts (#3638)
Related to https://github.com/grafana/oncall-private/issues/2347
2024-01-10 18:54:27 +00:00
Matias Bordese
f68b9dd004
Update auditor to check personal notifications (#3563)
Requires https://github.com/grafana/oncall/pull/3557

Related to https://github.com/grafana/oncall-private/issues/2347
2023-12-18 16:13:18 +00:00
Yulya Artyukhina
36227418ed
Speed up escalation auditor (#3578)
# What this PR does
Speed up escalation auditor
- use raw escalation snapshot instead of serialized one

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-12-18 12:28:55 +00:00
Matias Bordese
6dada51133
Remove unneeded filter making query slower (#3570)
There is no index for the `received_at` column, and the filter isn't
really needed (aggregation will work in any case, considering only the
entries for which we have data).
2023-12-14 18:25:34 +00:00
Matias Bordese
3feba3675b
Log average/max delta between alert ingestion and alert group creation (#3526)
Related to https://github.com/grafana/oncall-private/issues/2347
2023-12-07 16:03:41 +00:00
Ildar Iskhakov
784c5ee7c1
Add notifications success ratio log to auditor (#3312)
# What this PR does

This PR adds alert groups success ratio over last 48 hours

## Which issue(s) this PR fixes

## Checklist

- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-11-10 16:39:13 +08:00
Yulya Artyukhina
5bc7351671
Fix next step eta for silenced alert groups (#2887)
# What this PR does
Update `next_step_eta` in alert group escalation snapshot when alert
group is silenced for period

## Which issue(s) this PR fixes
Fixes the issue related to [this
one](https://github.com/grafana/oncall-private/issues/2028)

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)

---------

Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
Co-authored-by: Joey Orlando <joseph.t.orlando@gmail.com>
2023-08-28 12:13:01 +00:00
Joey Orlando
d5b43b0439
minor improvements for check_escalation_finished celery task (#2554)
# What this PR does

This PR adds some enhancements to the `check_escalation_finished` celery
task. It short-circuits auditing of an alert group if it does not have
an escalation chain associated with it. In
`EscalationSnapshotMixin.start_escalation_if_needed`
we will not set `raw_escalation_snapshot`
([here](https://github.com/grafana/oncall/blob/dev/engine/apps/alerts/escalation_snapshot/escalation_snapshot_mixin.py#L262))
in this case:
```python3
def start_escalation_if_needed(self, countdown=START_ESCALATION_DELAY, eta=None):
        """
        :type self:AlertGroup
        """
        AlertGroup = apps.get_model("alerts", "AlertGroup")

        is_on_maintenace_or_debug_mode = self.channel.maintenance_mode is not None

        if (
            self.is_restricted
            or is_on_maintenace_or_debug_mode
            or self.pause_escalation
            or not self.escalation_chain_exists <-- here
        ):
            logger.debug(
                f"Not escalating alert group w/ pk: {self.pk}\n"
                f"is_restricted: {self.is_restricted}\n"
                f"is_on_maintenace_or_debug_mode: {is_on_maintenace_or_debug_mode}\n"
                f"pause_escalation: {self.pause_escalation}\n"
                f"escalation_chain_exists: {self.escalation_chain_exists}"
            )
            return

        logger.debug(f"Start escalation for alert group with pk: {self.pk}")

        # take raw escalation snapshot from db if escalation is paused
        raw_escalation_snapshot = (
            self.build_raw_escalation_snapshot() if not self.pause_escalation else self.raw_escalation_snapshot
        )
        task_id = celery_uuid()

        AlertGroup.all_objects.filter(pk=self.pk,).update(
            active_escalation_id=task_id,
            is_escalation_finished=False,
            raw_escalation_snapshot=raw_escalation_snapshot,
        )
```

`EscalationSnapshotMixin.escalation_chain_exists` is as such:
```python3
@property
    def escalation_chain_exists(self) -> bool:
        if self.pause_escalation:
            return False
        elif not self.channel_filter:
            return False
        return self.channel_filter.escalation_chain is not None
```

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required) (N/A)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required) (N/A)
2023-07-17 14:04:53 +00:00
Joey Orlando
77f6dedce5
add index on started_at column in alert groups (#2516)
# What this PR does

Adds an index on the `started_at` column in the `alerts_alertgroup`
table. For the alert groups query used by the
`check_escalation_finished_task`, this resulted in a huge performance
boost, taking the query time from 89mins to 4secs (on our largest
production dataset).

## Which issue(s) this PR fixes

closes #724
closes https://github.com/grafana/oncall-private/issues/1713

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
2023-07-13 05:22:59 -04:00
Joey Orlando
4d655dff60
modify check_escalation_finished_task task (#1266)
# What this PR does

This PR:
- modifies the `check_escalation_finished_task` celery task to:
  - do stricter escalation validation based on the alert group's
escalation snapshot (see the `audit_alert_group_escalation` method in
`engine/apps/alerts/tasks/check_escalation_finished.py` for the
validation logic)
- use a read-only database for querying alert-groups if one is
configured, otherwise use the "default" one
- ping a configurable heartbeat (new env var
`ALERT_GROUP_ESCALATION_AUDITOR_CELERY_TASK_HEARTBEAT_URL` added)
- increase the task frequency from every 10 to every 13 minutes (this
can be configured via an env variable)
  - adds public documentation on how to configure this auditor task
- modifies the local celery startup command to properly take into
consideration all celery related env vars (similar to the ones we use in
`engine/celery_with_exporter.sh`; this made it easier to enable `celery
beat` locally for testing)
- removes the following code:
- removes references to `AlertGroup.estimate_escalation_finish_time` and
marks the model field as deprecated using the [`django-deprecate-fields`
library](https://pypi.org/project/django-deprecate-fields/). This field
was only used for the previous version of this validation task
- `EscalationSnapshotMixin.calculate_eta_for_finish_escalation` was only
used to calculate the value for
`AlertGroup.estimate_escalation_finish_time`
  - `calculate_escalation_finish_time` celery task
  

## Which issue(s) this PR fixes

https://github.com/grafana/oncall-private/issues/1558

## Checklist

- [x] Tests updated
- [x] Documentation added
- [x] `CHANGELOG.md` updated
2023-03-17 10:14:08 +00:00
Michael Derynck
6b40f95033 World, meet OnCall!
Co-authored-by: Eve832 <eve.meelan@grafana.com>
    Co-authored-by: Francisco Montes de Oca <nevermind89x@gmail.com>
    Co-authored-by: Ildar Iskhakov <ildar.iskhakov@grafana.com>
    Co-authored-by: Innokentii Konstantinov <innokenty.konstantinov@grafana.com>
    Co-authored-by: Julia <ferril.darkdiver@gmail.com>
    Co-authored-by: maskin25 <kengurek@gmail.com>
    Co-authored-by: Matias Bordese <mbordese@gmail.com>
    Co-authored-by: Matvey Kukuy <motakuk@gmail.com>
    Co-authored-by: Michael Derynck <michael.derynck@grafana.com>
    Co-authored-by: Richard Hartmann <richih@richih.org>
    Co-authored-by: Robby Milo <robbymilo@fastmail.com>
    Co-authored-by: Timur Olzhabayev <timur.olzhabayev@grafana.com>
    Co-authored-by: Vadim Stepanov <vadimkerr@gmail.com>
    Co-authored-by: Yulia Shanyrova <yulia.shanyrova@grafana.com>
2022-06-03 08:09:47 -06:00