OpenAI blamed the longest outage in its history on a failure of its “new telemetry service.”
On Wednesday, ChatGPT, OpenAI's AI-powered chatbot platform, was announced. That video generator, Sora. and its developer API experienced a massive disruption starting around 3:00 PM Pacific Time. OpenAI immediately recognized the problem and began working on a fix. However, it will take approximately three hours for the company to restore all services.
In a post-mortem published late Thursday, OpenAI wrote that the outage was not caused by a security incident or recent product launch, but rather by a telemetry service it introduced on Wednesday to collect Kubernetes metrics. are. Kubernetes is an open source program that helps manage containers, or packages of apps and related files, that are used to run software in isolated environments.
“Because the telemetry service has such a wide footprint, this new service configuration inadvertently caused… resource-intensive Kubernetes API operations,” OpenAI wrote in a post-mortem. . “[Our] The Kubernetes API servers became overloaded and the Kubernetes control plane went down on most of the large servers. [Kubernetes] cluster. ”
This is jargon-heavy, but essentially the new telemetry service impacted OpenAI's Kubernetes operations. This includes resources where many of the company's services rely on DNS resolution. DNS resolution converts IP addresses to domain names. This is why you can type “Google.com” instead of “142.250.191.78”.
The DNS cache used by OpenAI maintains information about previously searched domain names (such as website addresses) and their corresponding IP addresses, complicating the issue with “latency.”[ing] OpenAI writes: [of the telemetry service] Need to proceed before the full extent of the problem is understood. ”
OpenAI said it was able to detect the issue “minutes before” it ultimately started impacting customers, but was unable to implement a fix quickly because it had to work around an overabundance of Kubernetes servers.
“This was due to multiple systems and processes failing simultaneously and interacting in unexpected ways,” the company wrote. “Our tests did not capture the impact that the changes were having on the Kubernetes control plane.” [and] Repairs were delayed significantly due to the lockout. ”
To prevent similar incidents from occurring in the future, OpenAI is improving its phased rollouts to better monitor infrastructure changes and ensuring that OpenAI engineers can access its Kubernetes API servers under any circumstances. It said it would adopt a number of measures, including new mechanisms to ensure this.
“We apologize for the impact this incident has had on all of our customers, from ChatGPT users to developers to businesses that rely on OpenAI products,” OpenAI said in a statement. “We fell short of our expectations.”
TechCrunch has a newsletter focused on AI. Sign up here to get it delivered to your inbox every Wednesday.