Resolved -
This incident has been resolved. All platform and API operations are working normally.
Feb 27, 20:21 UTC
Monitoring -
API and platform operations have normalized. We are continuing to monitor to ensure full and stable recovery.
Background jobs are almost fully caught up. Users may still see slightly slower requests creating new apps / orgs, but they should complete successfully.
Sprite and MPG cluster creations are processing as normal.
Feb 27, 20:05 UTC
Update -
A second fix has been deployed and database load has returned to normal, resulting in API response times beginning to normalize. Most Machines API requests should succeed as normal, and deploys to existing apps should also work.
We are working through a backlog of background jobs. New app / organization creations and other other operations that use these will continue to see increased latency or failures while we work thorough these. New MPG cluster and new Sprite creation continues to be impacted.
Feb 27, 19:41 UTC
Update -
An initial fix has been deployed and we are seeing improvements in load and API performance. Some operations that rely on the Graphql API, such as new app creations and some deployments, will continue to fail at this time.
We are continuing to work on restoring full availability.
Feb 27, 19:23 UTC
Update -
We are currently seeing full API failures for requests to our Graphql API and elevated failures for the machines API. Direct calls to these apis may fail, along with many flyctl commands.
We have identified the cause of the issue and are continuing to work on a fix.
Existing running machines and apps should continue to be reachable, but creates, deploys, or other features relying on platform API calls will fail at this time.
Feb 27, 19:05 UTC
Update -
New Sprite creations are also timing out or failing at this time. We are continuing to work on a fix for this issue.
Feb 27, 18:59 UTC
Update -
We are continuing to work on a fix for this issue.
Feb 27, 18:53 UTC
Identified -
We have identified the cause of the increased latency and are working on a fix.
The most common errors we are seeing is timeouts when users attempt to perform an action against a newly created app / machine resource. Those may timeout or fail with an `app|machine not found` error
Feb 27, 18:52 UTC
Investigating -
We are investigating increased in API request latency and timeouts with the main platform API.
This is impacting multiple operations, including creating, querying or performing actions against machines, as well as platform level operations like adding payment methods.
Feb 27, 18:50 UTC