# What this PR does
Fixes a faulty escalation step "Continue escalation if >X alerts per Y
minutes".
## Which issue(s) this PR fixes
https://github.com/grafana/oncall/issues/895
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
The current implementation of the direct paging feature doesn't page
additional responders if the alert group is acknowledged, silenced, or
resolved, and doesn't show any warnings for such cases.
This PR makes so that adding responders for silenced & acknowledged
alert groups actually pages the selected user / schedule. For resolved
alert groups, a warning message will be shown both in web UI and Slack.
## Which issue(s) this PR fixes
Related to https://github.com/grafana/oncall/issues/2442
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Remove
[`apps.get_model`](https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.apps.get_model)
invocations and use inline `import` statements in places where models
are imported within functions/methods to avoid circular imports.
I believe `import` statements are more appropriate for most use cases as
they allow for better static code analysis & formatting, and solve the
issue of circular imports without being unnecessarily dynamic as
`apps.get_model`. With `import` statements, it's possible to:
- Jump to model definitions in most IDEs
- Automatically sort inline imports with `isort`
- Find import errors faster/easier (most IDEs highlight broken imports)
- Have more consistency across regular & inline imports when importing
models
This PR also adds a flake8 rule to ban imports of `django.apps.apps`, so
it's harder to use `apps.get_model` by mistake (it's possible to ignore
this rule by using `# noqa: I251`). The rule is not enforced on
directories with migration files, because `apps.get_model` is often used
to get a historical state of a model, which is useful when writing
migrations ([see this SO answer for more
details](https://stackoverflow.com/a/37769213)). So `apps.get_model` is
considered OK in migrations (even necessary in some cases).
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This task does not appear to have been invoked/have any logs associated
within the past month in any of our cloud environments. I'm fairly
certain it is deprecated and can be removed
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This is a follow up to #2502 which started to remove logic to
"archiving" alert groups. This PR:
- removes all references to `AlertGroup.is_archived` and marks the
column as deprecated. We will remove it in the next release
- removes the `AlertGroup.unarchived_objects` `Manager`
- renames the `AlertGroup.all_objects` `Manager` to `AlertGroup.objects`
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This PR adds some enhancements to the `check_escalation_finished` celery
task. It short-circuits auditing of an alert group if it does not have
an escalation chain associated with it. In
`EscalationSnapshotMixin.start_escalation_if_needed`
we will not set `raw_escalation_snapshot`
([here](https://github.com/grafana/oncall/blob/dev/engine/apps/alerts/escalation_snapshot/escalation_snapshot_mixin.py#L262))
in this case:
```python3
def start_escalation_if_needed(self, countdown=START_ESCALATION_DELAY, eta=None):
"""
:type self:AlertGroup
"""
AlertGroup = apps.get_model("alerts", "AlertGroup")
is_on_maintenace_or_debug_mode = self.channel.maintenance_mode is not None
if (
self.is_restricted
or is_on_maintenace_or_debug_mode
or self.pause_escalation
or not self.escalation_chain_exists <-- here
):
logger.debug(
f"Not escalating alert group w/ pk: {self.pk}\n"
f"is_restricted: {self.is_restricted}\n"
f"is_on_maintenace_or_debug_mode: {is_on_maintenace_or_debug_mode}\n"
f"pause_escalation: {self.pause_escalation}\n"
f"escalation_chain_exists: {self.escalation_chain_exists}"
)
return
logger.debug(f"Start escalation for alert group with pk: {self.pk}")
# take raw escalation snapshot from db if escalation is paused
raw_escalation_snapshot = (
self.build_raw_escalation_snapshot() if not self.pause_escalation else self.raw_escalation_snapshot
)
task_id = celery_uuid()
AlertGroup.all_objects.filter(pk=self.pk,).update(
active_escalation_id=task_id,
is_escalation_finished=False,
raw_escalation_snapshot=raw_escalation_snapshot,
)
```
`EscalationSnapshotMixin.escalation_chain_exists` is as such:
```python3
@property
def escalation_chain_exists(self) -> bool:
if self.pause_escalation:
return False
elif not self.channel_filter:
return False
return self.channel_filter.escalation_chain is not None
```
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required) (N/A)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required) (N/A)
# What this PR does
Adds an index on the `started_at` column in the `alerts_alertgroup`
table. For the alert groups query used by the
`check_escalation_finished_task`, this resulted in a huge performance
boost, taking the query time from 89mins to 4secs (on our largest
production dataset).
## Which issue(s) this PR fixes
closes#724
closes https://github.com/grafana/oncall-private/issues/1713
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
See more details comments alongside the code.
Regarding frontend changes, the main changes in this PR are to remove
unused fields on the `Team` interface + unused methods on the `Team`
model.
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required) (N/A)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required) (N/A)
# What this PR does
On notifying about shifts change in slack we don't check whether
organization was deleted. This PR excludes schedules from deleted
organizations from shift notification list
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
This PR adds new metric for Prometheus exporter
"user_was_notified_of_alert_groups" which counts how many alert groups
user was notified of.
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
# What this PR does
See #2173
Also, closes#2187 . All of the new files under `type_stubs/icalendar`
were autogenerated by running:
```bash
stubgen -p icalendar -o type_stubs
```
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Innokentii Konstantinov <innokenty.konstantinov@grafana.com>
# What this PR does
- Adds [`mypy` static type checking](https://mypy-lang.org/) to our CI
pipeline. Currently there is still a **ton** of errors being returned by
the tool, as we'll need to fix pre-existing errors. I think we can
slowly chip away at these errors in small PRs, doing them all in one
large PR is likely very risky.
- Also, this PR starts chipping away at one of the main type errors that
we have which is accessing the `datetime` class (from the `datetime`
library) or `timedelta` function on the `django.utils.timezone` module.
Basically we should be instead accessing these two objects from the
native `datetime` module. This makes sense because the [`__all__`
attribute](https://github.com/django/django/blob/main/django/utils/timezone.py#L14-L30)
in `django.utils.timezone` does not re-export `datetime` or `timedelta`.
- splits `engine` dependencies out into `requirements.txt` and
`requirements-dev.txt`
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated (N/A)
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required) (N/A)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required) (N/A)
# What this PR does
- update `make test` to always use `settings.ci-test`. Right now it will
use whatever the value of `DJANGO_SETTINGS_MODULE` is in
`./dev/.env.dev`, which causes ~45 tests to fail
- Fix several Python warnings that we see when running the tests
```bash
RemovedInDjango40Warning: The providing_args argument is deprecated. As it is purely documentational, it has no replacement. If you rely on this argument as documentation, you can move the text to a code comment or docstring.
alert_create_signal = django.dispatch.Signal(
```
```bash
PytestCollectionWarning: cannot collect test class 'TestOnlyBackend' because it has a __init__ constructor (from: apps/api/tests/test_alert_receive_channel_template.py)
class TestOnlyBackend(BaseMessagingBackend):
```
```bash
DeprecationWarning: The parameter 'use_aliases' in emoji.emojize() is deprecated and will be removed in version 2.0.0. Use language='alias' instead.
To hide this warning, pin/downgrade the package to 'emoji~=1.6.3'
return emoji.emojize(self.verbal_name, use_aliases=True)
```
```bash
DateTimeField CustomOnCallShift.start received a naive datetime (2023-06-01 12:53:12) while time zone support is active.
warnings.warn("DateTimeField %s received a naive datetime (%s)"
```
```bash
apps/twilioapp/tests/test_phone_calls.py::test_resolve_by_phone
/etc/app/apps/twilioapp/tests/test_phone_calls.py:173: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.
content = BeautifulSoup(content, features="html.parser").findAll(text=True)
```
```bash
apps/twilioapp/tests/test_phone_calls.py::test_resolve_by_phone
apps/twilioapp/tests/test_phone_calls.py::test_wrong_pressed_digit
/usr/local/lib/python3.11/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
```
```bash
apps/twilioapp/tests/test_phone_calls.py::test_forbidden_requests
/usr/local/lib/python3.11/site-packages/social_django/urls.py:15: RemovedInDjango40Warning: django.conf.urls.url() is deprecated in favor of django.urls.re_path().
url(r'^login/(?P<backend>[^/]+){0}$'.format(extra), views.auth,
```
```bash
apps/twilioapp/tests/test_phone_calls.py: 66 warnings
/usr/local/lib/python3.11/site-packages/debug_toolbar/utils.py:255: DeprecationWarning: currentThread() is deprecated, use current_thread() instead
thread = threading.currentThread()
```
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
Before this change, a diff ical check (which happens with frequency with
imported ical), particularly with overrides in an API/terraform schedule
would trigger unexpected slack notifications because the prev vs current
ical comparison will flag a diff, but when comparing current and
previous shifts, `current_shifts` will have the shift in progress while
the `prev_shifts` calculated from the overrides-only diff will most of
the time be empty (unless you set/change an override at current time).
Simplified the checks to always compare previous current shifts (ie. the
ones in the schedule from the DB) vs the recalculated ones using the
(refreshed) ical data from the schedule.
# What this PR does
This PR moves phone notification logic into separate object PhoneBackend
and introduces PhoneProvider interface to hide actual implementation of
external phone services provider. It should allow add new phone
providers just by implementing one class (See SimplePhoneProvider for
example).
# Why
[Asterisk PR](https://github.com/grafana/oncall/pull/1282) showed that
our phone notification system is not flexible. However this is one of
the most frequent community questions - how to add "X" phone provider.
Also, this refactoring move us one step closer to unifying all
notification backends, since with PhoneBackend all phone notification
logic is collected in one place and independent from concrete
realisation.
# Highligts
1. PhoneBackend object - contains all phone notifications business
logic.
2. PhoneProvider - interface to external phone services provider.
3. TwilioPhoneProvider and SimplePhoneProvider - two examples of
PhoneProvider implementation.
4. PhoneCallRecord and SMSRecord models. I introduced these models to
keep phone notification limits logic decoupled from external providers.
Existing TwilioPhoneCall and TwilioSMS objects will be migrated to the
new table to not to reset limits counter. To be able to receive status
callbacks and gather from Twilio TwilioPhoneCall and TwilioSMS still
exists, but they are linked to PhoneCallRecord and SMSRecord via fk, to
not to leat twilio logic into core code.
---------
Co-authored-by: Yulia Shanyrova <yulia.shanyrova@grafana.com>
# What this PR does
This PR set the limit so that workers won't attempt to autoresolve too
big alertmanager alert groups.
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
- Change FIRING trigger for webhooks to be sent after escalation
snapshot has been computed
- Extract users from `notify_to_users_queue` and `notify_schedule` from
escalation snapshot to populate `users_to_be_notified` in webhook
payload
# What this PR does
This PR:
- modifies the `check_escalation_finished_task` celery task to:
- do stricter escalation validation based on the alert group's
escalation snapshot (see the `audit_alert_group_escalation` method in
`engine/apps/alerts/tasks/check_escalation_finished.py` for the
validation logic)
- use a read-only database for querying alert-groups if one is
configured, otherwise use the "default" one
- ping a configurable heartbeat (new env var
`ALERT_GROUP_ESCALATION_AUDITOR_CELERY_TASK_HEARTBEAT_URL` added)
- increase the task frequency from every 10 to every 13 minutes (this
can be configured via an env variable)
- adds public documentation on how to configure this auditor task
- modifies the local celery startup command to properly take into
consideration all celery related env vars (similar to the ones we use in
`engine/celery_with_exporter.sh`; this made it easier to enable `celery
beat` locally for testing)
- removes the following code:
- removes references to `AlertGroup.estimate_escalation_finish_time` and
marks the model field as deprecated using the [`django-deprecate-fields`
library](https://pypi.org/project/django-deprecate-fields/). This field
was only used for the previous version of this validation task
- `EscalationSnapshotMixin.calculate_eta_for_finish_escalation` was only
used to calculate the value for
`AlertGroup.estimate_escalation_finish_time`
- `calculate_escalation_finish_time` celery task
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1558
## Checklist
- [x] Tests updated
- [x] Documentation added
- [x] `CHANGELOG.md` updated
This should fix task error as seen in logs, trying to parse an empty
string as ical value:
```
Task apps.schedules.tasks.refresh_ical_files.refresh_ical_file[] raised unexpected: ValueError("Found no components where exactly one is required: ''")
```
# What this PR does
This PR simplifies code of maintenance mode.
1. Perform distribution/escalation maintenance checks in send_signal...
tasks.
2. Use usual alert distribution flow for the maintenance incident.
3. Decouple maintenance mode from slack (all, except
**notify_about_maintenance_action** methods, I don't want to make this
PR too big)
As a bonus from these changes, maintenance mode now mute alert group
delivery in all chatops integrations, not only in slack. (Before,
incidents happened while maintenance were posted to telegram and msteams
anyway)
## Checklist
- [ ] Tests updated
- [ ] Documentation added
- [ ] `CHANGELOG.md` updated
**What this PR does**:
- Keep grafana version on create/update contact points to avoid multiple
requests to alerting
- Add retry limit on create contact point async
- Fix bugs related on create contact point
- Update logs on create/update contact point, make them more clear
- Avoid unnecessary requests to Grafana Alerting
* Modify plugin.json to support RBAC role registration
* defines 26 new custom roles in plugin.json. The main roles are:
- Admin: read/write access to everything in OnCall
- Reader: read access to everything in OnCall
- OnCaller : read access to everything in OnCall + edit access to Alert Groups and Schedules
- <object-type> Editor: read/write access to everything related to <object-type>
- <object-type> Reader: read access for <object-type>
- User Settings Admin: read/write access to all user's settings, not just own settings. This is in comparison to User Settings Editor which can only read/write own settings
* update changelog and documentation (#686)
* implement RBAC for OnCall backend
This commit refactors backend authorization. It trys to use RBAC authorization if the org's grafana instance supports it, otherwise it falls back to basic role authorization.
* update RBAC backend tests
* add tests for RBAC changes
- run backend tests as matrix where RBAC is enabled/disabled. When RBAC is enabled, the permissions granted are read from the role grants in the frontend's plugin.json file (instead of relying what we specify in RBACPermission.Permissions)
- remove --reuse-db --nomigrations flags from engine/tox.ini
- minor autoformatting changes to docker-compose-developer.yml
* remove --ds=settings.ci-test from pytest CI command
DJANGO_SETTINGS_MODULE is already specified as an env var so this is just unecessary duplication
* update gitignore
* update github action job name for "test"
* RBAC frontend changes
* refactors the use of basic roles (ex. Viewer, Editor, Admin) use RBAC permissions (when supported), or falling back to basic roles when RBAC is not supported.
- updates the UserAction enum in grafana-plugin/src/state/userAction.ts. Previously this was hardcoded to a list of strings that were being returned by the OnCall API. Now the values here correspond to the permissions in plugin.json (plus a fallback role)
* changes per Gabriel's comments:
- get rid of group attribute in rbac roles
- remove displayName role attribute
- remove hidden role attribute
- add back role to includes section
* don't try to update user timezone if they don't have permission
* Improve feedback so template errors are given to user
* Add security error logging
* Add limits for templates, payloads, results
* Show popup error notification for webhook errors and template errors that don't have a result
* Update tests
* Split exceptions into warnings/errors to give more control when previewing, rendering, saving templates
* Limit title lengths
* Make TypeError a warning
* Adjust title length limit
* Remove length limiting on urlize since it is being done on template render
* Fix tests
* Add KeyError and ValueError to warnings
* No longer enforcing json result when saving webhook in case it is dependent on payload
* Add tests for expected exceptions coming from apply_jinja_template
* Update changelog
* Send raw post if template result is not JSON
* move mobile notifications to a separate backend, remove critical notification
* remove outdated mobile app code
* MOBILE_APP_PUSH_NOTIFICATIONS_ENABLED -> FEATURE_MOBILE_APP_INTEGRATION_ENABLED
* create error log if no devices are set up
* move mobile auth related code to the mobile_app Django app
* move mobile auth related code to the mobile_app Django app
* move mobile auth related code to the mobile_app Django app
* fix typing
* add GCMDevice todos
* add user connection capabilities
* add user connect/disconnect to the messaging backend
* move APNS endpoint to mobile_app Django app
* restore critical notifications
* support hackathon app
* tweak migrations so mobile app auth tokens are preserved
* reuse notify_by IDs
* use mobile app template to render push notification
* add GCM/FCM (Android) support
* fix unlink user
* logger.error -> logger.info
* remove email verification related code
* remove email verification related code
* remove sendgrid callback
* remove sendgrid related code
* remove sendgrid related code
* rename sendgrid app to email
* remove email from built-in channels
* remove email from built-in channels
* remove email from built-in channels
* add email backend: https://github.com/grafana/oncall/pull/50
* add email templater
* add email templater
* convert md to html
* add email settings to live settings
* use task to send email, handle some exceptions to create logs
* remove ERROR_NOTIFICATION_MAIL_DELIVERY_FAILED usage
* add email limit logic
* fix tests
* add docs
* remove old email templates
* remove old email templates
* add template_fields to messaging backend
* add messaging backends templates to public api
* add comment for deprecated fields
* fix test
* fix tests
* disable email by default
* don't retry on SMTPException and TimeoutError
* add tests
* bring email back to public api docs
* return ERROR_NOTIFICATION_MAIL_LIMIT_EXCEEDED
* make template_fields tuple
* build_subject_and_title -> build_subject_and_message
* add one more comment about template deprecation
* use 8 as backend id
* add comment about gaierror and BadHeaderError
* add comment on importing in notify_user_async
* edit oss docs
* New ical comparision func
* Add support for field `sequence` for custom on-call shifts
* Fix ical comparison
* New ical comparision func 2
* fix
* Revert "Add support for field `sequence` for custom on-call shifts"
This reverts commit b7b18d5a
Co-authored-by: Julia <ferril.darkdiver@gmail.com>
* use web title template to render alert group verbose name
* remove group_verbose_name from tests
* clean up group_verbose_name
* remove verbose_name from API & plugin
* verbose_name migration
* update verbose name on web title template change
* use long queue for updating verbose name
* use first alert for updating verbose name
* improve batch_ids
* fix update_verbose_name
* post-review fixes
* post-review fixes
* use web title template to render alert group verbose name
* remove group_verbose_name from tests
* clean up group_verbose_name
* remove verbose_name from API & plugin
* verbose_name migration
* update verbose name on web title template change
* use long queue for updating verbose name
* use first alert for updating verbose name
* improve batch_ids
* Remove auto-recreating logic for UserNotificationPolicy
It's removed to get rid of select_for_update on User on each notify_user_task
* Fix and add tests
* remove get_user_policies method