Resolved -
On February 26, 2024, between 18:34 UTC and 19:37 UTC our background job service was degraded and caused job start delays up to 15 minutes. Users experienced delays in Webhooks, Actions, and some UI updates (e.g. a delay in UI updates on pull requests). This was due to capacity problems with our job queueing service, and a failure of our automated failover system.
We mitigated the incident by manually failing over to our secondary cluster. No data was lost - recovery began at 18:55 UTC, when the backlog of enqueued jobs began to process.
We are actively working to repair our failover automation and expand the capacity of our background job queuing service to prevent issues like this in the future.
Feb 26, 19:37 UTC
Update -
Actions and Pull Requests are operating normally.
Feb 26, 19:37 UTC
Update -
Webhooks and Issues are operating normally.
Feb 26, 19:37 UTC
Update -
Issues is experiencing degraded performance. We are continuing to investigate.
Feb 26, 19:05 UTC
Update -
Pull Requests is experiencing degraded performance. We are continuing to investigate.
Feb 26, 18:57 UTC
Update -
We have deployed a fix for issues affecting Webhooks, Actions, and some other services. We are beginning to see recovery and will continue to monitor and fix as needed.
Feb 26, 18:55 UTC
Update -
Webhooks is experiencing degraded performance. We are continuing to investigate.
Feb 26, 18:55 UTC
Update -
Actions is experiencing degraded performance. We are continuing to investigate.
Feb 26, 18:48 UTC
Investigating -
We are investigating reports of degraded performance for Webhooks
Feb 26, 18:47 UTC