Chang She, previously VP of Engineering at Tubi, is a Cloudera veteran with years of experience building data tools and infrastructure. But when she started working in her AI field, he quickly ran into traditional data infrastructure issues that prevented her from putting her AI models into production.
“Machine learning engineers and AI researchers are often locked into substandard development experience,” she told TechCrunch in an interview. “Data infrastructure companies don't understand the machine learning data problem at a fundamental level.”
So Chang (one of the co-creators of the wildly popular Python data science library Pandas) co-founded LanceDB in collaboration with software engineer Lei Xu.
LanceDB builds the eponymous open source database software LanceDB. It's designed to support multimodal AI models, that is, models that are trained to generate images, videos, etc. in addition to text. Backed by Y Combinator, LanceDB raised $8 million in seed funding this month led by CRV, Essence VC, and Swift Ventures, bringing its total funding to $11 million.
“If multimodal AI is critical to your company's future success, you want your very expensive AI team to focus on bridging models, AI, and business value,” says Chan. “Unfortunately, AI teams currently spend most of their time dealing with low-level data infrastructure details. LanceDB provides the foundation that AI teams need, so teams can truly You'll be free to focus on what's important, and you'll be able to bring your AI products to market much faster than you would otherwise.”
LanceDB is essentially a vector database. A database that contains a sequence of numbers (a “vector”) that encodes the meaning of unstructured data (images, text, etc.).
As my colleague Paul Sawers recently wrote , vector databases are having a moment as the AI hype cycle reaches its peak. That's because it's useful for all kinds of AI applications, from content recommendations on e-commerce and social media platforms to hallucination mitigation.
Competition for vector databases is fierce. See Qdrant, Vespa, Weaviate, Pinecone, Chroma to name a few vendors (excluding big tech companies). So what's unique about LanceDB? According to Chang, it offers increased flexibility, performance, and scalability.
For one, LanceDB, built on Apache Arrow, leverages Lance Format, a custom data format optimized for multimodal AI training and analysis, Chang said. Lance Format enables LanceDB to process up to billions of vectors and petabytes of text, images, and video, and allows engineers to manage the various formats of metadata associated with that data. .
“Until now, there has been no system that can integrate training, exploration, search, and large-scale data processing,” says Chan. “Lance Format gives AI researchers and engineers a single source of truth and blazing-fast performance across their AI pipelines. It's more than just storing vectors. ”
LanceDB makes money by selling fully managed versions of its open source software with added features such as hardware acceleration and governance controls, and business appears to be strong. The company's customer list includes his Midjourney text-to-image conversion platform, chatbot unicorn Character.ai, self-driving car startup WeRide, and Airtable.
However, LanceDB's recent VC backing won't shift attention away from the open source project, Zhang insists, which he says currently receives about 600,000 downloads per month.
“We wanted to build something that would make it 10 times easier for AI teams to work with large-scale multimodal data,” he said. “LanceDB offers and continues to offer a very rich set of ecosystem integrations to minimize deployment effort.”