DeepL has made a name for itself with its online text translations, which it claims are more nuanced and accurate than services like Google. The pitch catapulted the German startup to a $2 billion valuation and more than 100,000 paying customers.
Now, as expectations for AI services continue to rise, DeepL is adding another mode to its platform: audio. DeepL Voice allows users to hear someone speaking in one language and have it automatically translated into another language in real time.
Currently, the languages that DeepL can “listen” to are English, German, Japanese, Korean, Swedish, Dutch, French, Turkish, Polish, Portuguese, Russian, Spanish, and Italian. . Translated captions are available in all 33 languages currently supported by DeepL Translator.
Image credit: DeepL (Opens in new window) under license (Opens in new window).
DeepL Voice currently stops short of delivering results as audio or video files themselves. This service is intended for real-time live conversations and video conferencing, which are sent as text rather than audio.
The first method allows you to set your translations to appear as a “mirror” on your smartphone. The idea is to place a phone between two people on a conference table and see the translated words on both sides. Alternatively, you can view it as a transcription. You share next to someone. Video conferencing services display translations as subtitles.
That could change over time, the company's founder and CEO Jarek Kutylowski (pictured above) hinted in an interview. This is DeepL's first product for voice, but it's unlikely to be its last. “[Voice] “That's where translation will really roll out next year,” he added.
There is other evidence to support that statement. Google, one of DeepL's biggest competitors, has also started incorporating real-time translated captions into its video conferencing service Meet. There are also AI startups building voice translation services, such as Eleven Lab (Eleven Lab Dubbing), an AI voice specialist company, and Panjaya, which creates translations using “deepfake” voices and videos that match the voices. There are many.
The latter uses Eleven Labs' API, and Kutylowski said Eleven Labs itself uses DeepL's technology to power its translation services.
Audio output isn't the only feature that hasn't been released yet.
There is also no API for voice products at this time. DeepL's main business is focused on B2B, and Kutylowski said the company works directly with partners and customers.
Also, there aren't many integration options. The only video calling service that currently supports DeepL subtitles is Teams, which “covers most customers,” Kutylowski said. There's no word on when or if DeepL Voice will be integrated into Zoom or Google Meet.
For DeepL users, this product will feel like the first time in a while, and not just because there are so many other AI voice services out there for translation purposes. According to Kutylowski, this has been a top customer request since DeepL launched in 2017.
One reason for the latency is that DeepL takes a fairly conservative approach to building its product. Unlike many other applications in the world of AI applications that rely on other companies' large-scale language models (LLMs) to fine-tune them, DeepL's goal is to build services from the ground up. In July, the company released a new LLM optimized for translation. This is said to be better than GPT-4 and LLM from Google and Microsoft because its main purpose is translation. The company also continues to improve the quality of its documentation and glossary.
Similarly, one of the unique selling points of DeepL Voice is that it works in real time. This is important. That's because many “AI translation” services on the market actually work with a delay, making them difficult or impossible to use in live situations. This is the use case that DeepL is working on.
Kutylowski hinted that this is another reason why new speech processing products are focused on text-based translation. Translations can be computed and generated very quickly, but there are still challenges in processing and AI architecture before speech and speech can be generated. Video coming soon.
While video conferencing and meetings are possible use cases for DeepL Voice, Kutylowski said another major use case the company envisions is in the service industry, where front-line restaurant employees He pointed out that the service could be used to communicate more easily with customers.
While this may be useful, it also highlights one of the service's rough spots. In a world where we are all suddenly more conscious about data protection and concerned about how new services and platforms capture personal and proprietary information, it is important that people make their voices heard. It remains to be seen how eager I will be to receive one. It is used like this.
Kutylowski claimed that while the audio is sent to a server to be translated (no processing occurs on the device), nothing is retained in the system and is not used to train LLMs. Ultimately, DeepL works with its customers to ensure that they do not violate GDPR or other data protection regulations.