ML Pod Startup Failures
Incident Report for Scale AI
Resolved
There was an outage with a service that manages the metadata for models from 7:06am PT to 7:50am PT. It prevented model services from starting up new pods. ML services that received security updates, deployed a new version, or attempted to scale up were impacted. There was an additional 8 to 12 minutes of service degradation after restoring the impacted ML services to catch up on missed requests.
Posted Mar 30, 2023 - 14:30 UTC