Gladia, a French startup that provides speech recognition application programming interfaces (APIs), has raised $16 million in a Series A funding round. Essentially, Gladia's API allows you to convert any audio file to text with a high level of accuracy and a short turnaround time.
Amazon, Microsoft, and Google all offer speech-to-text APIs as part of their cloud hosting product suites, but they don't offer the same performance as newer models offered by specialized startups.
There has been significant progress in this area over the past few years, especially after the release of Whisper by OpenAI. Gladia competes with other well-funded companies in the space, including AssemblyAI, Deepgram, and Speechmatics.
Gladia originally offered a tweaked version of Whisper's speech-to-text model, with some much-needed improvements. For example, the startup supports out-of-the-box diarization. It can detect when a conversation has multiple speakers and separate the recording and transcribed text depending on who is speaking.
Gladia supports 100 languages and various accents. This reporter has been using Gladia to transcribe some interviews, and accents weren't an issue, so we can confirm that it works.
The startup offers its speech-to-text model as a hosted API that users can consume in their own applications and services. More than 600 companies use Gladia, including meeting recorders and note-taking assistants such as Attendance, Circleback, Method Financial, Recall, Sana, and Veed.io.
This particular use case is interesting because many companies need to chain API calls. They first convert speech to text and feed it into large-scale language models (LLMs) such as GPT-4o or Claude 3.5 Sonnet to extract knowledge from vast walls of text.
With the new funding, Gladia hopes to simplify its pipeline by consolidating audio intelligence and LLM-based tasks into a single API call. For example, customers can get a summary of a conversation generated from a few bullet points without relying on a third-party LLM API.
Another problem that Gladia is trying to solve is latency. You may have seen a demo of real-time voice conversations with an AI-based calling agent (there's a great demo on the 11x website). To make these conversations sound human-like, these systems need to be able to transcribe them in near real-time. -As similar as possible.
“We realized that real-time quality wasn't very good across the market, and people had weird use cases. They wanted to do real-time processing and then take audio and batch We wondered, “Why are we doing this?” They told us, “The quality isn't as good with real-time processing, so we'll post it later in batches,” co-founder and CEO Jean-Louis Quéguiner (pictured above, right) told TechCrunch. Ta.
Gladia chose to tackle this problem and is now able to transcribe live conversations with less than 300 milliseconds of latency. The company claims that real-time processing is now roughly equivalent to the default asynchronous batch transcription API, but it's difficult to tell without proper testing. As Quéguiner says, the startup aims for “batch quality with real-time capabilities.”
AI call agents aside, imagine a call center using these real-time features to help call agents find relevant information during a call. “Our single API is compatible with all existing technology stacks and protocols, including SIP, VoIP, FreeSwitch, and Asterisk,” co-founder and CTO Jonathan Soto (pictured above, left) said in a statement. states.
XAnge is leading the Series A funding round. Illuminate Financial, XTX Ventures, Athletico Ventures, Gainels, Mana Ventures, Motier Ventures, Roosh Ventures and Soma Capital also participated.
Gladia believes that a “ChatGPT moment” for audio applications is just around the corner. GPT technology has been around for years, but ChatGPT really popularized LLM with its consumer-facing chat-like interface.
As Apple and Google begin to incorporate transcription models within iOS and Android, consumers will begin to see the value of automatic transcription within the apps they use. Developers will then integrate audio functionality into their products, and that's where API providers like Gladia come into the picture.