centralcloud/oncall-engine

Author	SHA1	Message	Date
Vadim Stepanov	b7e2dc14f8	Fix ratelimit bug (#4108 ) # What this PR does Fixes a bug in the ratelimit logic when integration-specific ratelimit 429s are still counted towards the organization-wide ratelimit. ## Which issue(s) this PR closes Related to https://github.com/grafana/support-escalations/issues/9579 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] Added the relevant release notes label (see labels prefixed w/ `release:`). These labels dictate how your PR will show up in the autogenerated release notes.	2024-03-26 17:20:05 +00:00
Yulya Artyukhina	ed60b67884	Integtation backsync endpoint (#4082 ) # What this PR does Adds endpoint for integration backsync Related to https://github.com/grafana/oncall-private/issues/2542 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] Added the relevant release notes label (see labels prefixed w/ `release:`). These labels dictate how your PR will show up in the autogenerated release notes.	2024-03-20 11:26:33 +00:00
Matias Bordese	eb1228e782	Update universal integrations to reject requests without payload (#4053 ) Reject integration requests with a null payload	2024-03-14 15:51:46 +00:00
Vadim Stepanov	2b5554c079	Remove explicit request size limits (#3878 ) # What this PR does Remove explicit request size limits both for uwsgi & Django. After this change, the effective request size limit will be 2.5MB as the default Django value for [DATA_UPLOAD_MAX_MEMORY_SIZE](https://docs.djangoproject.com/en/4.2/ref/settings/#data-upload-max-memory-size). ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2024-02-22 15:00:33 +00:00
Joey Orlando	aca2804502	add `pytest-xdist` to speed up backend tests (#3839 ) # What this PR does Speeds up `pytest` test execution by ~30%. More specifically, adds [`pytest-xdist`](https://pytest-xdist.readthedocs.io/en/stable/), which according to their docs: > plugin extends pytest with new test execution modes, the most used being distributing tests across multiple CPUs to speed up test execution Before <img width="270" alt="Screenshot 2024-02-05 at 15 53 13" src="https://github.com/grafana/oncall/assets/9406895/4da33299-5bd0-4dc3-86e1-32cfdf9106f7"> After <img width="254" alt="Screenshot 2024-02-05 at 15 53 04" src="https://github.com/grafana/oncall/assets/9406895/a59eeb52-291d-4cdc-82b2-55fd31e1c1c5">	2024-02-05 16:04:15 -05:00
Joey Orlando	3833d8de56	remove manual alert group (`/oncall`) slack slash command + `force_route_id` (#3790 ) # What this PR does Related to [this discussion](https://raintank-corp.slack.com/archives/C04JCU51NF8/p1706550226831949) Removes the `/oncall` Slack slash command + the concept of `force_route_id` (as this Slack slash command was the last piece of code to use this concept [here](https://github.com/grafana/oncall/blob/dev/engine/apps/slack/scenarios/manual_incident.py#L146)) ## TODO before merging - [x] update the various env's Slack apps to remove the slash command from the app manifests ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2024-01-30 17:28:23 -05:00
Joey Orlando	06933a696a	Support alert routing based on labels (#3778 ) # What this PR does This PR adds support for routing alerts based on labels. https://www.loom.com/share/4401de6e3c4945d5b8961fe43ee373c9 Additionally: - improve the typing around the `get_object` method that is inherited by [`PublicPrimaryKeyMixin.get_object`](https://github.com/grafana/oncall/blob/dev/engine/common/api_helpers/mixins.py#L153) in most of our models. `PublicPrimaryKeyMixin` is generic, so it can be more strongly typed when it is being subclassed, which results in better typing of the `get_object` method in child classes - I decided to do this because I started looking into this task via the [`AlertReceiveChannelView.send_demo_alert` method/endpoint](https://github.com/grafana/oncall/blob/dev/engine/apps/api/views/alert_receive_channel.py#L242). Within that method, `instance` is not typed because the inherited `get_object` method is not typed.. I digress 😄 - improve typing around `Alert.create` and `apps.integrations.tasks.create_alert` functions - make `Alert.render_group_data` more DRY by extracting some logic out into `Alert._apply_jinja_template_to_alert_payload_and_labels` - deduplicate the logic of `value.strip().lower() in ["1", "true", "ok"]` into a shared function, `common.jinja_templater.apply_jinja_template.templated_value_is_truthy` Closes https://github.com/grafana/oncall-private/issues/2490 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) - [x] Documentation added (or `pr:no public docs` PR label added if not required) (will be done in #3762)	2024-01-30 13:07:19 -05:00
Ildar Iskhakov	401d279d54	Refactor create_alert task (#3759 ) # What this PR does This PR simplifies alert group/alert creation, so the alert created and escalation started in the same task. ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [ ] Documentation added (or `pr:no public docs` PR label added if not required) - [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2024-01-30 08:39:04 +00:00
Matias Bordese	dbd5452a0b	Handle a possible outdated cached integration error (#3741 ) Related to [logs](https://ops.grafana-ops.net/explore?schemaVersion=1&panes=%7B%22hum%22:%7B%22datasource%22:%22c-R8UWvVk%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22amixr-prod%5C%22,%20job%3D%5C%22amixr-prod%2Famixr-integrations%5C%22%7D%20%7C%3D%20%5C%22django.core.serializers.base.DeserializationError%5C%22%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22c-R8UWvVk%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%221706023840486%22,%22to%22:%221706024722486%22%7D%7D%7D&orgId=1)	2024-01-23 20:46:12 +00:00
Yulya Artyukhina	0421bc472a	Fix posting slack message about ratelimits (#3582 ) # What this PR does ## Which issue(s) this PR fixes https://github.com/grafana/oncall-private/issues/2374 ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-12-19 06:05:57 +00:00
Matias Bordese	054401a214	Fix missing timestamp value, add test (#3522 )	2023-12-06 16:02:54 +00:00
Matias Bordese	e053eb084d	Track alert received timestamp on alert group creation (#3513 ) Keep record of the timestamp when the alert group creation task is triggered, allowing to track the delta time between alert received datetime and alert group creation timestamp. Related to https://github.com/grafana/oncall-private/issues/2347	2023-12-06 12:20:03 +00:00
Matias Bordese	6e35aadc0c	Add test ensuring integration endpoints work if redis cache is down (#3445 )	2023-12-01 17:45:18 +00:00
Joey Orlando	76a88bc0c1	Revert "upgrade to Python 3.12 (#3456 )" and "bump uwsgi version to latest #3466 " (#3483 ) # What this PR does This reverts commits `7c4b40a046` and `cdb22285db`. See https://github.com/grafana/oncall-private/pull/2361 for more details.	2023-12-01 09:56:26 -05:00
Joey Orlando	7c4b40a046	upgrade to Python 3.12 (#3456 ) # What this PR does Upgrade to Python 3.12 + fix several invalid test assertions that lead to test failures in the latest version of `pytest`: ``` AttributeError: 'called_once_with' is not a valid assertion. Use a spec for the mock if 'called_once_with' is meant to be an attribute.. Did you mean: 'assert_called_once_with'? ``` ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-11-30 13:47:41 +00:00
Matias Bordese	aa8a904a8d	Update when slack client ratelimit retry handler is enabled (#3447 )	2023-11-30 12:35:46 +00:00
Ildar Iskhakov	95a3ab3b75	Revert "Cache independent ingestion" (#3417 ) Reverts grafana/oncall#3415	2023-11-23 21:38:06 +08:00
Ildar Iskhakov	a6912c96af	Merge pull request #3415 from grafana/iskhakov/cache-independent-ingestion Cache independent ingestion	2023-11-23 16:22:18 +08:00
Ildar Iskhakov	566e8c53ba	Ignore typing checks for imported library (https://mypy.readthedocs.io/en/stable/running_mypy.html\#missing-library-stubs-or-py-typed-marker )	2023-11-23 16:14:30 +08:00
Ildar Iskhakov	0d5ef785bf	Make alert ingestion cache independent	2023-11-23 11:27:47 +08:00
Michael Derynck	b3583cd1a0	Add more logging info on alert creation (#3392 ) # What this PR does Add alert receive channel id when logging to make it easier to trace grouping ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) --------- Co-authored-by: Joey Orlando <joey.orlando@grafana.com>	2023-11-21 16:16:15 +00:00
Matias Bordese	8ddea0576e	Add test ensuring ingestion works without db access (#3322 ) Handling alert payloads should work without db access (but still requires cached integrations information)	2023-11-10 17:44:37 +00:00
Michael Derynck	ad1f63dbe9	AmazonSNS integration exception handling (#3315 ) # What this PR does Handle OrganizationMoved, OrganizationDeleted and PermissionDenied exceptions same as other integration API views instead of converting to BadRequest. ## Which issue(s) this PR fixes ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-11-10 13:45:12 +00:00
Matvey Kukuy	a898835eb4	Fixing ratelimit texts	2023-09-27 14:18:00 +03:00
Vadim Stepanov	8b2212c7dc	Improve Slack error handling (#3000 ) # What this PR does - Rename `SlackClientWithErrorHandling` to just `SlackClient` - Add more error classes + improve the way errors are raised based on the Slack error code - Add API call retries on Slack server errors (e.g. when Slack returns `5xx` errors) - Refactor some methods working with Slack API + add tests ## Which issue(s) this PR fixes - https://github.com/grafana/oncall-private/issues/1837 - https://github.com/grafana/oncall-private/issues/1840 - https://github.com/grafana/oncall-private/issues/1842 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-09-12 09:49:16 +00:00
Joey Orlando	a9155130df	update slack_sdk dependency to latest version (#2947 ) # What this PR does - update `slackclient` dependency to latest version. The version we were using was 5 years old 😲 - first followed the v2 migration guide [here](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x) followed by the v3 migration guide [here](https://slack.dev/python-slack-sdk/v3-migration/). The main changes were: - The PyPI project was renamed from `slackclient` to `slack_sdk` - it is discouraged/harder to call `api_call` and encouraged to call the helper methods (ex. `chat_postMessage`; [note](https://github.com/slackapi/python-slack-sdk/wiki/Migrating-to-2.x#web-client-api-changes) in migration guide docs) - In 1.x, a failed api call would return the error payload to you and have you handle the error. In 2.x, a failed api call will throw an exception. To handle this in your code, you will have to wrap api calls with a try except block. Since we overload `WebClient.api_call` this was an easy change and only required a one line change - remove `apps.slack.slack_client.slack_server.SlackClientServer` class. The new version of `slack_sdk` handles the case that we needed to overload for in the first place. - merged `apps/slack/slack_client/slack_client.py` and `apps/slack/slack_client/exceptions.py` into `apps/slack/client.py` ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-09-05 11:31:59 +02:00
Matias Bordese	0dea5661c4	Reject file uploads when posting to an integration endpoint (#2958 ) Related to https://github.com/grafana/oncall-private/issues/2145 --------- Co-authored-by: Joey Orlando <joey.orlando@grafana.com>	2023-09-05 10:01:50 +02:00
Michael Derynck	e0e1f4b021	Always update last_heartbeat_time async (#2892 ) # What this PR does If the same heartbeat is requested at a high rate it can create lock contention when updating the timestamp in the DB. Moving to always run update in task should free up the connection on the API server faster, although the task might still see some lock wait time. ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-08-29 02:19:28 +00:00
Matias Bordese	cec5e6a284	Skip amazon_sns integration view test (#2849 )	2023-08-21 17:06:31 -03:00
Joey Orlando	df21be3a50	add more integration tests for integrations api (#2845 )	2023-08-21 15:40:29 +02:00
Matias Bordese	179a1db471	Add alertmanager integration for heartbeat support (#2807 ) Related to https://github.com/grafana/oncall/issues/2801 and https://github.com/grafana/support-escalations/issues/7081. --------- Co-authored-by: Innokentii Konstantinov <innokenty.konstantinov@grafana.com>	2023-08-17 13:22:37 +00:00
Ildar Iskhakov	fd19dd422a	Use periodic task for heartbeats (#2723 ) # What this PR does ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [ ] Documentation added (or `pr:no public docs` PR label added if not required) - [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) --------- Co-authored-by: Joey Orlando <joey.orlando@grafana.com> Co-authored-by: Michael Derynck <michael.derynck@grafana.com>	2023-08-10 02:25:00 +00:00
Innokentii Konstantinov	5538a80ba1	Fix GrafanaAlertingAPIView	2023-08-01 16:00:50 +08:00
Innokentii Konstantinov	7c4f72c348	Fix num_firing/resolved calculation	2023-08-01 13:37:31 +08:00
Innokentii Konstantinov	abca37e621	Polish amv2 (#2701 )	2023-08-01 13:13:58 +08:00
Innokentii Konstantinov	1ccb9d6979	AlertManager v2 (#2643 ) Introduce AlertManager v2 integration with improved internal behaviour it's using grouping from AlertManager, not trying to re-group alerts on OnCall side. Existing AlertManager and Grafana Alerting integrations are marked as Legacy with options to migrate them manually now or be migrated automatically after DEPRECATION DATE(TBD). Integration urls and public api responses stay the same both for legacy and new integrations. --------- Co-authored-by: Rares Mardare <rares.mardare@grafana.com> Co-authored-by: Joey Orlando <joey.orlando@grafana.com>	2023-08-01 12:18:52 +08:00
Vadim Stepanov	f977f9faee	Minor formatting changes (#2641 ) # What this PR does - Updates `black` and `flake8` to latest - Removes `F541` from flake8 ignore (`F541 f-string is missing placeholders`) - Enables ["float to top" option](https://pycqa.github.io/isort/docs/configuration/options.html#float-to-top) for `isort`	2023-07-26 14:45:44 +01:00
Vadim Stepanov	b2f4ffb98a	`apps.get_model` -> `import` (#2619 ) # What this PR does Remove [`apps.get_model`](https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.apps.get_model) invocations and use inline `import` statements in places where models are imported within functions/methods to avoid circular imports. I believe `import` statements are more appropriate for most use cases as they allow for better static code analysis & formatting, and solve the issue of circular imports without being unnecessarily dynamic as `apps.get_model`. With `import` statements, it's possible to: - Jump to model definitions in most IDEs - Automatically sort inline imports with `isort` - Find import errors faster/easier (most IDEs highlight broken imports) - Have more consistency across regular & inline imports when importing models This PR also adds a flake8 rule to ban imports of `django.apps.apps`, so it's harder to use `apps.get_model` by mistake (it's possible to ignore this rule by using `# noqa: I251`). The rule is not enforced on directories with migration files, because `apps.get_model` is often used to get a historical state of a model, which is useful when writing migrations ([see this SO answer for more details](https://stackoverflow.com/a/37769213)). So `apps.get_model` is considered OK in migrations (even necessary in some cases). ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-07-25 09:43:23 +00:00
Innokentii Konstantinov	6dcaf52efb	Remove integration html instructions (#2627 ) Remove integration html instructions, They were migrated to the docs	2023-07-25 04:31:28 +00:00
Joey Orlando	63ac0972c5	remove deprecated heartbeat_heartbeat table/model (#2534 ) # What this PR does - Remove `heartbeat_heartbeat` table. This model/table does not seems to be deprecated/used anywhere (no data in this in production/staging; see more comments in the code about this).	2023-07-17 01:38:04 -04:00
Innokentii Konstantinov	ab7cd0aec2	Fix amv2	2023-06-14 14:43:00 +08:00
Innokentii Konstantinov	f0f2e7c8c6	Draft AlertManager integration v2 (#2167 ) # What this PR does Introduces AlertManagerV2 integration with better grouping and autoresolving, not intended for production use yet. --------- Co-authored-by: Ildar Iskhakov <Ildar.iskhakov@grafana.com>	2023-06-13 07:10:38 +00:00
Innokentii Konstantinov	056b0ddc7e	Add ratelimit for AmazonSNS (#2032 ) Adds a ratelimit for AmazonSNS. AlertChannelDefining mixin is now injecting alert_receive_channel only in request, not in kwargs to not to break AmazonSNS.	2023-05-26 09:57:26 +00:00
Vadim Stepanov	b8f54f1c53	Add docs & logo for AppDynamics integration (#1916 ) # What this PR does Adds docs & logo for AppDynamics integration. Main PR in private repo: https://github.com/grafana/oncall-private/pull/1790. ## Which issue(s) this PR fixes https://github.com/grafana/oncall-private/issues/1621 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - No changelog (AppDynamics integration will be only available in cloud)	2023-05-11 16:41:51 +00:00
Joey Orlando	014a9c2ec2	allow the POST incoming alert endpoints to queue create_alert tasks independent of the database status (#1896 ) # What this PR does https://www.loom.com/share/18cc445117de4895a10892d56c7d3699 In preparation to upgrade our cloud databases, this PR makes some minor changes which, after testing locally, allowed the `POST /<integration_type>/<alert_channel_key>` endpoints to successfully receive incoming alerts and queue the celery tasks. I've tested all of the defined `POST /integrations/v1/<integration_type>/<alert_channel_key>` endpoints by sending `POST` requests to an integrations' URL while the MySQL database was down, bringing the database back up, and ensuring the alerts were created. ## Some other findings - the integration heartbeat endpoints will not work as we interact w/ the database to persist the incoming heartbeat instance - if the integration was created in the last 180 seconds, incoming alerts will fail due to the way we cache the integration IDs ([code](https://github.com/grafana/oncall/blob/dev/engine/apps/integrations/mixins/alert_channel_defining_mixin.py#L47-L50)) - The `create_alert` celery task is set to `max_retries=None` and `retry_backoff=True`. This means that the queued tasks will continue retrying forever w/ an exponential backoff, until the alerts can be created in the database (ie. when the database is back online). ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated (N/A) - [ ] Documentation added (or `pr:no public docs` PR label added if not required) (N/A) - [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) (N/A)	2023-05-10 12:36:23 +00:00
Oleg Zaytsev	41f7c23c65	Fix and tidy alertmanager heartbeat template (#1865 ) # What this PR does There was an unnecessary indentation in the `rules:` key which made it invalid YAML. Also replaced the mentions to Amixr with Grafana OnCall, used some `<code>` tags and reworded some sentences. Also removed the anchor tag from the webhook link: we don't want people to follow that in their browser, we want them to copy it ## Result screenshot ![image](https://user-images.githubusercontent.com/1511481/236173565-b5201b81-4d69-4d0b-944a-a2106f8fbab3.png) ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [ ] Documentation added (or `pr:no public docs` PR label added if not required) - [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required) --------- Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Joey Orlando <joey.orlando@grafana.com>	2023-05-05 00:25:05 +00:00
Vadim Stepanov	d198b932c1	Zendesk inbound integration docs (#1860 ) # What this PR does Add docs & logo for Zendesk integration. Main PR in private repo: https://github.com/grafana/oncall-private/pull/1772 ## Which issue(s) this PR fixes https://github.com/grafana/oncall-private/issues/1627 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] No changelog (Zendesk integration will be only available in cloud)	2023-05-03 11:38:07 +01:00
Vadim Stepanov	50eb1fed5d	Jira inbound integration docs (#1842 ) # What this PR does Add docs & logo for Jira integration. Main PR in private repo: https://github.com/grafana/oncall-private/pull/1769 ## Which issue(s) this PR fixes https://github.com/grafana/oncall-private/issues/1620 ## Checklist - [x] Unit, integration, and e2e (if applicable) tests updated - [x] Documentation added (or `pr:no public docs` PR label added if not required) - [x] No changelog (Jira integration will be only available in cloud)	2023-05-02 09:37:49 +00:00
Ildar Iskhakov	6e61643750	Limit number of alertmanager alerts in alert group to autoresolve (#1779 ) # What this PR does This PR set the limit so that workers won't attempt to autoresolve too big alertmanager alert groups. ## Which issue(s) this PR fixes ## Checklist - [ ] Unit, integration, and e2e (if applicable) tests updated - [ ] Documentation added (or `pr:no public docs` PR label added if not required) - [ ] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not required)	2023-04-24 05:38:21 +00:00
Vadim Stepanov	ea60c0d247	Inbound email integration (#837 ) This PR add Inbound Email integration. It designed to support some variety of ESPs, but in prod we will use Mailgun, so locally I tested it only with mailgun ESP. Important: To make it work on different clusters I'm planning to provide different email domains for different regions, like ....@us.oncall.grafana.net, ...@eu.oncall.grafana.net --------- Co-authored-by: Innokentii Konstantinov <innokenty.konstantinov@grafana.com>	2023-03-16 13:59:21 +08:00

1 2

66 commits