In a typical year, one of Google's two major annual developer conferences, Cloud Next and I/O, focuses almost exclusively on managed APIs, closed source, and gated behind-lockdown content. API products and services will be featured. But this year, whether to foster developer goodwill or advance its ecosystem ambitions (or both), Google has launched a number of open source projects primarily aimed at supporting generative AI projects and infrastructure. We debuted the tool.
The first, MaxDiffusion, which Google actually quietly released in February, is a collection of reference implementations of various diffusion models (such as those from the image generation tool Stable Diffusion) that run on XLA devices. “XLA” stands for Accelerated Linear Algebra, an admittedly tricky acronym that refers to technology that optimizes and accelerates certain types of AI workloads, including fine-tuning and services.
Google's own tensor processing units (TPUs) are XLA devices, similar to modern Nvidia GPUs.
Beyond MaxDiffusion, Google has launched Jetstream, a new engine for running generative AI models, specifically text generation models (and therefore not Stable Diffusion). Although currently limited to support for TPUs with GPU compatibility that are expected to arrive in the future, Jetstream offers “performance per dollar” for models such as Google's Gemma 7B and Meta's Llama 2. Google claims up to 3x improvement.
Mark Lohmeyer, Google Cloud's GM of Compute and Machine Learning Infrastructure, said in a blog post shared with TechCrunch that “as customers bring AI workloads into production, they'll need cost-effective solutions that deliver high performance. There is a growing demand for inference stacks.” “JetStream addresses this need…and includes optimizations for popular open models such as Llama 2 and Gemma.”
Now, while a “3x” improvement is quite a claim, it's not exactly clear how Google arrived at that number. What generation of his TPU is he using? Which baseline engine do you compare it to? How is “performance” defined here in the first place?
We've asked Google all of these questions and will update this post if we hear back.
Penultimate on Google's list of open source contributions is a new addition to MaxText, Google's collection of text generation AI models targeting TPUs and Nvidia GPUs in the cloud. MaxText currently includes models from Gemma 7B, OpenAI's GPT-3 (predecessor to GPT-4), Llama 2, and AI startup Mistral. All of this can be customized and fine-tuned to suit a developer's needs, Google says.
“Significantly optimized [the models’] Along with optimizing performance on the TPU, we have partnered closely with Nvidia to optimize performance on large GPU clusters,” said Lohmeyer. “These improvements maximize GPU and TPU utilization, improve energy efficiency, and optimize cost.”
Finally, Google collaborated with AI startup Hugging Face to create the Optimum TPU, which provides tools to bring specific AI workloads to the TPU. Google says the goal is to reduce the barrier to entry for bringing generative AI models, particularly text generation models, to TPU hardware.
But for now, Optimum TPUs are the bare minimum. The only model that works is Gemma 7B. Additionally, Optimum TPU does not yet support training generative models on TPUs, only running them on TPUs.
Google is looking forward to future improvements.