The AI industry is increasingly moving towards generative AI models with longer context. However, models with large context windows tend to be computationally expensive. Or so Dagan, his lead for products at AI startup AI21 Labs, argues, that doesn't necessarily have to be the case, and his company has released a generative model to prove it.
Context, or context window, refers to input data (such as text) that a model considers before producing output (additional text). Models with small context windows tend to forget even the most recent conversations, whereas models with large contexts avoid this pitfall and have the added benefit of better visibility into the data flow they ingest.
AI21 Labs' Jamba, a new text generation and analysis model, can perform many of the same tasks as models such as OpenAI's ChatGPT and Google's Gemini. Trained on a combination of public and proprietary data, Jamba can produce text in English, French, Spanish, and Portuguese.
Jamba can process up to 140,000 tokens while running on a single GPU (such as a high-end Nvidia A100) with at least 80 GB of memory. This equates to about 105,000 words, or 210 pages, making it a decent-sized novel.
By comparison, Meta's Llama 2 has a 32,000-token context window (on the small side by today's standards), but only requires a GPU with about 12 GB of memory to run. (Context windows are typically measured in tokens, which are bits of raw text or other data.)
At first glance, Jumba is unremarkable. There are a number of generative AI models available for free download, from Databricks' recently released DBRX to his aforementioned Llama 2.
But what makes Jamba unique is what's under the hood. It uses a combination of his two model architectures: transformers and state-space models (SSM).
Transformers are an ideal architecture for complex inference tasks, powering models such as GPT-4 and Google's Gemini, for example. Trance has several unique characteristics, but by far the most distinctive feature of trance is its “attention mechanism.” For each input data (such as a sentence), the transformer evaluates the relevance of all other inputs (other sentences) and extracts from them to produce an output (a new sentence).
SSM, on the other hand, combines several characteristics of older types of AI models, such as recurrent neural networks and convolutional neural networks, to create a more computationally efficient architecture that can process long data sequences.
Now, SSM has its limits. However, some of the earlier reified models, including an open-source model called Mamba by researchers at Princeton University and Carnegie Mellon University, were able to handle larger inputs than their transformer-based counterparts and were less effective at language generation tasks. Delivers excellent performance.
In fact, Jamba uses Mamba as part of its core model, which Dagan claims delivers three times the throughput in long contexts compared to transformer-based models of comparable size.
“There are some early academic examples of SSM models, but this is the first commercial-grade, production-scale model,” Dagan said in an interview with TechCrunch. “This architecture offers great efficiency and throughput potential, in addition to being innovative and interesting for further research by the community.”
Jamba is currently released under the Apache 2.0 License, an open source license with relatively few usage restrictions, but Dagan emphasizes that this is not a research release intended for commercial use. This model has no safeguards to prevent the generation of harmful text or mitigations to address potential bias. A tweaked and ostensibly “more secure” version is expected to be available in the coming weeks.
However, Dagan argues that Jamba is demonstrating the promise of SSM architecture even at this early stage.
“The added value of this model is that it can be easily attached to a single GPU, both due to its size and innovative architecture,” he said. “We believe that further tweaks to Mamba will further improve performance.”