As deepfakes proliferate, OpenAI is improving the technology used to clone voices, but the company insists it's doing so responsibly.
Today marks the preview debut of OpenAI's Voice Engine, which extends the company's existing text-to-speech API. Voice Engine, which has been in development for about two years, allows users to upload a 15-second sample of his voice and generate a synthetic copy of that voice. However, there is no public release date yet, giving the company time to respond to how the model is used and abused.
“We want to make sure everyone has peace of mind about how this technology is deployed,” said Jeff Harris, a member of OpenAI’s product staff. “We are taking mitigation measures,” he said. He said this in an interview with TechCrunch.
Training the model
Generative AI models that power voice engines have been kept under the radar for some time, Harris says.
The same model is based on the voice and “speech” capabilities of OpenAI's AI-powered chatbot, ChatGPT, and the preset voices available in OpenAI's text-to-speech API. And since early September, Spotify has been using it to dub podcasts by prominent hosts like Rex Fridman in different languages.
I asked Harris where the training data for the model came from. This is a somewhat sensitive topic. He would only say that the voice engine model was trained using a combination of licensed and publicly available data.
Models, like the one that powers Voice Engine, are typically trained on large numbers of examples (in this case, audio recordings) taken from public sites and data sets on the web. Many generative AI vendors view training data as a competitive advantage, so they hold the data and associated information close to their chests. But the details of training data are also a potential source of IP-related litigation, providing another motivation for revealing more.
By training our AI on copyrighted content such as photos, artwork, code, articles, and e-books, OpenAI is able to comply with intellectual property laws without providing credit or payment to the creator or owner. He has already been sued for allegedly violating the law.
OpenAI has licensing agreements with some content providers, including Shutterstock and news publisher Axel Springer, that allow webmasters to block the company's web crawler from scraping their sites for training data. can. OpenAI also allows artists to “opt out” and have their work removed from the datasets the company uses to train image generation models, including his latest DALL-E 3.
However, OpenAI does not offer such an opt-out scheme for other products. And in a recent statement to the UK House of Lords, OpenAI suggested that it is “impossible” to create useful AI models without copyrighted material, and argued that fair use (copyright He advocated a legal doctrine that allows works protected under the law to be used as derivative works. As long as it's transformational, protect the part that has to do with training the model.
synthesize audio
Surprisingly, the voice engine is not trained or fine-tuned based on user data. Part of the reason is the temporal way in which the model (a combination of diffusion process and transducer) produces sound.
“We take a small audio sample and text and generate a realistic voice that matches the original speaker,” Harris says. “The used audio will be discarded once the request is complete.”
He explained that the model simultaneously analyzes the extracted audio data and the text data to be read and generates matching speech without having to build custom models for each speaker.
It's not a new technology. A number of startups have offered voice cloning products over the years, from ElementalLabs to Replica Studios to Papercup to Deepdub to Respeecher. So are Big Tech incumbents like Amazon, Google, and Microsoft. By the way, that last company is a major investor in his OpenAI.
Harris claimed that OpenAI's approach provides higher quality audio across the board.
We also know there will be aggressive pricing. OpenAI removed Voice Engine pricing from marketing materials published today, but documents seen by TechCrunch list Voice Engine as costing $15 per million characters, or about 162,500 words. That would easily apply to Dickens' “Oliver Twist.'' (The “HD” quality option is twice that price, but confusingly, an OpenAI spokesperson told his TechCrunch there is no difference between HD and non-HD audio.) It depends on how you understand it.)
This equates to approximately 18 hours of audio and costs slightly less than $1 per hour. This is certainly cheaper than the rates of one of his most popular competing vendors, Celebrities, which costs $11 per 100,000 characters per month. However, some customization is required.
Voice Engine does not have controls for adjusting the tone, pitch, or rhythm of your voice. In fact, there are no fine-tuning knobs or dials available at this time, but Harris says that the expressiveness of a 15-second audio sample will carry over into the next generation (for example, if you speak in an excited tone, , the resulting synthesized voice sounds consistently excited). If we can compare other models directly, let's see how the read quality compares.
Voice actor talent as a product
Voice actor salaries on ZipRecruiter range from $12 to $79 per hour, and even at the lower end it's much more expensive than Voice Engine (actors with agents are paid much more per project) . If this becomes widespread, OpenAI's tools could commoditize audio works. So where does that leave the actor?
The human resources industry has been grappling with the existential threat of generative AI for some time. Voice actors are increasingly being asked to sign over the rights to their voices so that clients can use their AI to produce synthetic versions that ultimately replace the voice actor. Voice jobs, especially cheap entry-level jobs, are at risk of being eliminated in favor of AI-generated voice.
Some AI voice platforms are currently trying to strike a balance.
Replica Studios signed a somewhat controversial deal with SAG-AFTRA last year to create and license copies of the voices of Media Artists Guild members. The organizations said the deal establishes fair and ethical terms of use to ensure performers' consent when negotiating the terms of use of synthetic voices in new works, including video games.
Eleven Labs, on the other hand, hosts a marketplace for synthetic voices where users can create, verify, and share publicly their voices. When someone else uses the audio, the original creator receives a set amount per 1,000 characters.
OpenAI does not intend to establish such union agreements or markets, at least in the short term, and does not require users to obtain “explicit consent” from the people whose voices are cloned, and which voices are generated by the AI. It only requires “clear disclosure” to indicate whether the I agree not to use the voices of minors, deceased persons, or politicians of my generation.
“How this intersects with the voice actor economy is something we are watching closely and are very interested in,” Harris said. “I think there's a lot of opportunity to expand your potential as a voice actor through this kind of technology. But this all depends on people actually adopting this technology and playing around with it a little bit and what we learn. It will be.”
Ethics and deepfakes
Voice cloning apps can and have been misused in ways that go beyond threatening actors' livelihoods.
4chan, a notorious message board known for its conspiratorial content, used Eleven Labs' platform to share hateful messages imitating celebrities like Emma Watson. The Verge's James Vincent uses AI tools to maliciously and rapidly clone audio, producing samples containing everything from violent threats to racist and transphobic remarks. is completed. At Vice, reporter Joseph Cox documented the creation of voice clones convincing enough to fool banks' verification systems.
There are concerns that bad actors could use voice clones to try to sway elections. And they are not without basis. In January, a phone campaign featuring a deepfaked version of President Biden was used to dissuade New Hampshire residents from voting, prompting the FCC to make future such campaigns illegal.
So, other than banning deepfakes at the policy level, what steps is OpenAI taking to prevent abuse of its voice engine? Harris mentioned a few.
First, Voice Engine is only available to a very small group of developers (about 10 people) at launch. Harris said OpenAI is prioritizing “responsible” synthetic media experiments as well as “low-risk” and “socially beneficial” use cases such as healthcare and accessibility.
Early adopters of Voice Engine include edtech company Age of Learning, which uses tools to generate voiceovers from previously cast actors, and HeyGen, a storytelling app that leverages Voice Engine for translation. there is. Livox and Lifespan are using Voice Engine to create voices for people with speech impairments and disabilities, and Dimagi is building Voice Engine-based tools to provide feedback to healthcare workers in their first language. I am.
Here is the audio generated from Lifespan:
And here's one from Livox.
Second, clones created with Voice Engine are watermarked using a technique developed by OpenAI that embeds an inaudible identifier into the recording. (Other vendors, including Resemble AI and Microsoft, have adopted similar watermarks.) Harris did not promise that there is no way around watermarks, but he said watermarks are “tamper-resistant. ”.
“If there's an audio clip there, it's very easy for us to look at that clip and determine that it was produced by our system and the developer who actually produced it. ,” Harris said. “Right now, it's not open sourced. It's in-house for now. We're interested in making it available to the public, but obviously that comes with additional risks in terms of exposure and corruption. Masu.”
Third, OpenAI will provide members of its Red Team Network, a contracted expert group that helps inform risk assessments and mitigation strategies for its AI models, access to Voice Engine to investigate malicious uses. We plan to provide
Some experts argue that red teaming AI is not exhaustive enough and that it is incumbent on vendors to develop tools to protect against the harm that AI can cause. Although OpenAI hasn't gone that far with his Voice Engine, Harris insists that the company's “overriding principle” is to release the technology securely.
General sale
Depending on how the preview progresses and Voice Engine's public reception, OpenAI may release the tool to a broader developer base, but the company is reluctant to commit to anything specific at this point.
However, Harris did give a glimpse of the Voice Engine roadmap, showing that OpenAI will send randomly generated text to users as proof that they exist and know how their voices are being used. revealed that it is testing security mechanisms that allow it to be read. This could give OpenAI the confidence it needs to bring its Voice Engine to more people, or it could be just the beginning, Harris said. Ta.
“What continues to move us forward in terms of real-world voice matching technology will depend on what we learn from the pilots, the safety issues that are uncovered, and the mitigations that we put in place,” he said. Told. “We don't want people to confuse synthetic voices with real human voices.”
And we can all agree on that last point.