OpenAI's GPT-4o, the generative AI model behind the recently released alpha version of ChatGPT's Advanced Voice Mode, is the company's first model trained not only on voice but also on text and image data. This sometimes leads to GPT-4o doing odd things, like mimicking the voice of people talking to it or suddenly shouting in the middle of a conversation.
In a new “red team” report documenting its examination of the model's strengths and risks, OpenAI uncovers some of GPT-4o's odd quirks, including the aforementioned voice cloning. In rare cases, OpenAI says, GPT-4o “emulates the user's voice,” especially when a person speaks to it in “environments with a lot of background noise,” like cars on the road. Why? Well, OpenAI says it's because the model struggles to understand incomplete speech. Well, duh!
Listen to what it sounds like in the sample below (from the report): Weird, right?
To be clear, GPT-4o doesn't currently do this, at least in its advanced voice mode: an OpenAI spokesperson told TechCrunch that the company has added “system-level mitigations” for the behavior.
GPT-4o also has a tendency, when prompted in certain ways, to generate disturbing or inappropriate “non-verbal vocalizations” or sound effects, such as erotic moans, intense screams, gunshots, etc. OpenAI says it has evidence to suggest that the model generally rejects requests to generate sound effects, but acknowledges that some requests do go through.
GPT-4o could also potentially infringe on music copyrights, but that would have been the case if OpenAI hadn't implemented a filter to prevent this: In the report, OpenAI said it had instructed GPT-4o not to sing in the limited alpha of Advanced Voice Mode, presumably to avoid copying the style, tone, or timbre of famous artists.
This implies, but doesn't outright confirm, that OpenAI trained GPT-4o on copyrighted material, and it's unclear whether OpenAI intends to lift the restrictions when Advanced Voice Mode rolls out to more users in the fall, as previously announced.
“Taking into account the audio modality of GPT-4o, we have updated certain text-based filters to work with audio conversations. [and] “We built a filter to detect and block output containing music,” OpenAI wrote in the report. “We trained GPT-4o to reject requests for copyrighted content, including audio, consistent with our broader practice.”
Notably, OpenAI recently said that it's “impossible” to train today's leading models without using copyrighted material, and while the company has numerous licensing agreements with data providers, it argues that fair use is a reasonable defense against accusations that it trains on IP-protected data, such as musical compositions, without permission.
The red team report paints a picture of an AI model that, given OpenAI's competitors, has been made safer through a variety of mitigations and safeguards: For example, GPT-4o refuses to identify people based on how they speak, refuses to answer meaningful questions like “How smart is this speaker?”, blocks prompts for violent and sexual language, and outright bans certain categories of content, such as discussions of extremism and self-harm.