OpenAI began rolling out ChatGPT's advanced voice mode on Tuesday, giving users access to GPT-4o's ultra-realistic voice responses for the first time. The alpha version was released today to a small number of ChatGPT Plus users, and OpenAI said the feature will be gradually rolled out to all Plus users in fall 2024.
When OpenAI first showed off GPT-4o's voices in May, the feature wowed audiences with its fast response and striking resemblance to real human voices, especially one. The voice, named Sky, resembled that of actress Scarlett Johansson, who played the artificial assistant in the film Her. Shortly after OpenAI's demo, Johansson said she had declined multiple inquiries from CEO Sam Altman about using her voice, and that after seeing the GPT-4o demo, she had hired a lawyer to defend her likeness. OpenAI denied using Johansson's voice, but later removed the voice shown in the demo. In June, OpenAI announced it would delay the release of its advanced voice mode to improve safety measures.
A month later, the wait is over (sort of). OpenAI says that the video and screen sharing features introduced in the Spring Update are not included in this alpha and will be released “at a later date.” For now, the GPT-4o demo that wowed everyone is still just a demo, but some premium users will have access to the ChatGPT voice features shown there.
ChatGPT can now speak and listen
You may have already tried the voice modes currently available in ChatGPT, but according to OpenAI, the advanced voice mode is different. ChatGPT's older audio solution used three separate models: one to convert speech to text, another to handle prompts, GPT-4, and a third to convert ChatGPT's text to speech. But GPT-4o is multimodal and can handle these tasks without the help of auxiliary models, creating conversations with significantly less latency. OpenAI says GPT-4o can sense emotional intonations in the voice, such as sadness, excitement, and singing.
The pilot allows ChatGPT Plus users to see first-hand just how ultra-realistic OpenAI's advanced voice mode really is. TechCrunch wasn't able to test the feature before publishing this article, but will review it once it has access.
OpenAI says it is gradually releasing the new voices for ChatGPT and closely monitoring their usage: Users in the alpha group will receive an alert in the ChatGPT app, followed by an email with instructions on how to use it.
In the months since OpenAI's demo, the company says it has tested GPT-4o's voice capabilities with more than 100 external red team members speaking 45 languages. OpenAI says it will publish a report on those safeguards in early August.
The company said the advanced voice mode will be limited to four of ChatGPT's preset voices (Juniper, Breeze, Cove and Ember), which were created in collaboration with paid voice actors. The Sky voice featured in OpenAI's May demo is no longer available in ChatGPT. “ChatGPT cannot impersonate the voices of other people, including individuals or public figures, and we block output that differs from these preset voices,” OpenAI spokesperson Lindsay McCallum said.
OpenAI has sought to avoid the controversy surrounding deepfakes, after voice-cloning technology from AI startup ElevenLabs was used in January to impersonate President Biden and deceive voters in the New Hampshire primary.
OpenAI also said it has introduced new filters that block certain requests to generate music and other copyrighted audio. AI companies have been in legal trouble for copyright infringement in the last year, and audio models like GPT-4o have given rise to a whole new category of companies that can file complaints. Record labels, in particular, have a history of being litigious, and have already sued AI music generation tools Suno and Udio.