# What this PR does
Fixes a bug in the ratelimit logic when integration-specific ratelimit
429s are still counted towards the organization-wide ratelimit.
## Which issue(s) this PR closes
Related to https://github.com/grafana/support-escalations/issues/9579
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
show up in the autogenerated release notes.
# What this PR does
Adds endpoint for integration backsync
Related to https://github.com/grafana/oncall-private/issues/2542
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] Added the relevant release notes label (see labels prefixed w/
`release:`). These labels dictate how your PR will
show up in the autogenerated release notes.
# What this PR does
Remove explicit request size limits both for uwsgi & Django. After this
change, the effective request size limit will be 2.5MB as the default
Django value for
[DATA_UPLOAD_MAX_MEMORY_SIZE](https://docs.djangoproject.com/en/4.2/ref/settings/#data-upload-max-memory-size).
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Related to [this
discussion](https://raintank-corp.slack.com/archives/C04JCU51NF8/p1706550226831949)
Removes the `/oncall` Slack slash command + the concept of
`force_route_id` (as this Slack slash command was the last piece of code
to use this concept
[here](https://github.com/grafana/oncall/blob/dev/engine/apps/slack/scenarios/manual_incident.py#L146))
## TODO before merging
- [x] update the various env's Slack apps to remove the slash command
from the app manifests
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
This PR adds support for routing alerts based on labels.
https://www.loom.com/share/4401de6e3c4945d5b8961fe43ee373c9
Additionally:
- improve the typing around the `get_object` method that is inherited by
[`PublicPrimaryKeyMixin.get_object`](https://github.com/grafana/oncall/blob/dev/engine/common/api_helpers/mixins.py#L153)
in most of our models. `PublicPrimaryKeyMixin` is generic, so it can be
more strongly typed when it is being subclassed, which results in better
typing of the `get_object` method in child classes
- I decided to do this because I started looking into this task via the
[`AlertReceiveChannelView.send_demo_alert`
method/endpoint](https://github.com/grafana/oncall/blob/dev/engine/apps/api/views/alert_receive_channel.py#L242).
Within that method, `instance` is not typed because the inherited
`get_object` method is not typed.. I digress 😄
- improve typing around `Alert.create` and
`apps.integrations.tasks.create_alert` functions
- make `Alert.render_group_data` more DRY by extracting some logic out
into `Alert._apply_jinja_template_to_alert_payload_and_labels`
- deduplicate the logic of `value.strip().lower() in ["1", "true",
"ok"]` into a shared function,
`common.jinja_templater.apply_jinja_template.templated_value_is_truthy`
Closes https://github.com/grafana/oncall-private/issues/2490
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
- [x] Documentation added (or `pr:no public docs` PR label added if not
required) (will be done in #3762)
# What this PR does
This PR simplifies alert group/alert creation, so the alert created and
escalation started in the same task.
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/2374
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
Keep record of the timestamp when the alert group creation task is
triggered, allowing to track the delta time between alert received
datetime and alert group creation timestamp.
Related to https://github.com/grafana/oncall-private/issues/2347
# What this PR does
Upgrade to Python 3.12 + fix several invalid test assertions that lead
to test failures in the latest version of `pytest`:
```
AttributeError: 'called_once_with' is not a valid assertion. Use a spec for the mock if 'called_once_with' is meant to be an attribute.. Did you mean: 'assert_called_once_with'?
```
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
Add alert receive channel id when logging to make it easier to trace
grouping
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
# What this PR does
Handle OrganizationMoved, OrganizationDeleted and PermissionDenied
exceptions same as other integration API views instead of converting to
BadRequest.
## Which issue(s) this PR fixes
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
- Rename `SlackClientWithErrorHandling` to just `SlackClient`
- Add more error classes + improve the way errors are raised based on
the Slack error code
- Add API call retries on Slack server errors (e.g. when Slack returns
`5xx` errors)
- Refactor some methods working with Slack API + add tests
## Which issue(s) this PR fixes
- https://github.com/grafana/oncall-private/issues/1837
- https://github.com/grafana/oncall-private/issues/1840
- https://github.com/grafana/oncall-private/issues/1842
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
- update `slackclient` dependency to latest version. The version we were
using was 5 years old 😲
- first followed the v2 migration guide
[here](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x)
followed by the v3 migration guide
[here](https://slack.dev/python-slack-sdk/v3-migration/). The main
changes were:
- The PyPI project was renamed from `slackclient` to `slack_sdk`
- it is discouraged/harder to call `api_call` and encouraged to call the
helper methods (ex. `chat_postMessage`;
[note](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x#web-client-api-changes)
in migration guide docs)
- In 1.x, a failed api call would return the error payload to you and
have you handle the error. In 2.x, a failed api call will throw an
exception. To handle this in your code, you will have to wrap api calls
with a try except block. Since we overload `WebClient.api_call` this was
an easy change and only required a one line change
- remove `apps.slack.slack_client.slack_server.SlackClientServer` class.
The new version of `slack_sdk` handles the case that we needed to
overload for in the first place.
- merged `apps/slack/slack_client/slack_client.py` and
`apps/slack/slack_client/exceptions.py` into `apps/slack/client.py`
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
If the same heartbeat is requested at a high rate it can create lock
contention when updating the timestamp in the DB. Moving to always run
update in task should free up the connection on the API server faster,
although the task might still see some lock wait time.
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
Co-authored-by: Michael Derynck <michael.derynck@grafana.com>
Introduce AlertManager v2 integration with improved internal behaviour
it's using grouping from AlertManager, not trying to re-group alerts on
OnCall side.
Existing AlertManager and Grafana Alerting integrations are marked as
Legacy with options to migrate them manually now or be migrated
automatically after DEPRECATION DATE(TBD).
Integration urls and public api responses stay the same both for legacy
and new integrations.
---------
Co-authored-by: Rares Mardare <rares.mardare@grafana.com>
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
# What this PR does
Remove
[`apps.get_model`](https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.apps.get_model)
invocations and use inline `import` statements in places where models
are imported within functions/methods to avoid circular imports.
I believe `import` statements are more appropriate for most use cases as
they allow for better static code analysis & formatting, and solve the
issue of circular imports without being unnecessarily dynamic as
`apps.get_model`. With `import` statements, it's possible to:
- Jump to model definitions in most IDEs
- Automatically sort inline imports with `isort`
- Find import errors faster/easier (most IDEs highlight broken imports)
- Have more consistency across regular & inline imports when importing
models
This PR also adds a flake8 rule to ban imports of `django.apps.apps`, so
it's harder to use `apps.get_model` by mistake (it's possible to ignore
this rule by using `# noqa: I251`). The rule is not enforced on
directories with migration files, because `apps.get_model` is often used
to get a historical state of a model, which is useful when writing
migrations ([see this SO answer for more
details](https://stackoverflow.com/a/37769213)). So `apps.get_model` is
considered OK in migrations (even necessary in some cases).
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
# What this PR does
- Remove `heartbeat_heartbeat` table. This model/table does not seems to
be deprecated/used anywhere (no data in this in production/staging; see
more comments in the code about this).
# What this PR does
Introduces AlertManagerV2 integration with better grouping and
autoresolving, not intended for production use yet.
---------
Co-authored-by: Ildar Iskhakov <Ildar.iskhakov@grafana.com>
Adds a ratelimit for AmazonSNS.
AlertChannelDefining mixin is now injecting alert_receive_channel only
in request, not in kwargs to not to break AmazonSNS.
# What this PR does
Adds docs & logo for AppDynamics integration.
Main PR in private repo:
https://github.com/grafana/oncall-private/pull/1790.
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1621
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- No changelog (AppDynamics integration will be only available in cloud)
# What this PR does
https://www.loom.com/share/18cc445117de4895a10892d56c7d3699
In preparation to upgrade our cloud databases, this PR makes some minor
changes which, after testing locally, allowed the `POST
/<integration_type>/<alert_channel_key>` endpoints to successfully
receive incoming alerts and queue the celery tasks.
I've tested all of the defined `POST
/integrations/v1/<integration_type>/<alert_channel_key>` endpoints by
sending `POST` requests to an integrations' URL while the MySQL database
was down, bringing the database back up, and ensuring the alerts were
created.
## Some other findings
- the integration heartbeat endpoints will not work as we interact w/
the database to persist the incoming heartbeat instance
- if the integration was created in the last 180 seconds, incoming
alerts will fail due to the way we cache the integration IDs
([code](https://github.com/grafana/oncall/blob/dev/engine/apps/integrations/mixins/alert_channel_defining_mixin.py#L47-L50))
- The `create_alert` celery task is set to `max_retries=None` and
`retry_backoff=True`. This means that the queued tasks will continue
retrying forever w/ an exponential backoff, until the alerts can be
created in the database (ie. when the database is back online).
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated (N/A)
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required) (N/A)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required) (N/A)
# What this PR does
There was an unnecessary indentation in the `rules:` key which made it
invalid YAML.
Also replaced the mentions to Amixr with Grafana OnCall, used some
`<code>` tags and reworded some sentences.
Also removed the anchor tag from the webhook link: we don't want people
to follow that in their browser, we want them to copy it
## Result screenshot

## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
---------
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
# What this PR does
Add docs & logo for Zendesk integration. Main PR in private repo:
https://github.com/grafana/oncall-private/pull/1772
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1627
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] No changelog (Zendesk integration will be only available in cloud)
# What this PR does
Add docs & logo for Jira integration. Main PR in private repo:
https://github.com/grafana/oncall-private/pull/1769
## Which issue(s) this PR fixes
https://github.com/grafana/oncall-private/issues/1620
## Checklist
- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] No changelog (Jira integration will be only available in cloud)
# What this PR does
This PR set the limit so that workers won't attempt to autoresolve too
big alertmanager alert groups.
## Which issue(s) this PR fixes
## Checklist
- [ ] Unit, integration, and e2e (if applicable) tests updated
- [ ] Documentation added (or `pr:no public docs` PR label added if not
required)
- [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
This PR add Inbound Email integration.
It designed to support some variety of ESPs, but in prod we will use
Mailgun, so locally I tested it only with mailgun ESP.
**Important:**
To make it work on different clusters I'm planning to provide different
email domains for different regions, like ....@us.oncall.grafana.net,
...@eu.oncall.grafana.net
---------
Co-authored-by: Innokentii Konstantinov <innokenty.konstantinov@grafana.com>