Gemini 1.5 Pro, Google's most capable generative AI model, is now available in public preview on Vertex AI, Google's enterprise AI development platform. The company announced the news at its annual Cloud Next conference in Las Vegas this week.
Gemini 1.5 Pro was launched in February and joins Google's Gemini family of generative AI models. Arguably, what makes that headline unique is the amount of context it can handle. From 128,000 tokens up to 1 million tokens. Here, “tokens” refer to subdivided bits of raw data (such as the syllables “fan”, “tas”, “tic”, etc.). (with the word “amazing”).
One million tokens equates to approximately 700,000 words or approximately 30,000 lines of code. This is about 4 times the amount of data that Anthropic's flagship model, Claude 3, can accept as input, and about 8 times the amount of data that his GPT-4 Turbo max context in OpenAI can accept.
A model's context, or context window, refers to the initial set of data (such as text) that the model considers before producing output (such as additional text). A simple question – “Who won the 2020 US presidential election?” – serves as context, as does a movie script, email, essay, or e-book.
Models with small context windows tend to “forget” even the most recent conversation and go off topic. This is not necessarily true for models with large contexts. Additionally, larger context models better understand the narrative flow of the data they ingest, produce richer context-aware responses, and (at least hypothetically) reduce the need for fine-tuning and factual evidence. can do.
So what exactly can you do with a million-token context window? Google promises a lot, including analyzing code libraries, “reasoning” over long documents, and having long conversations with chatbots. ing.
Gemini 1.5 Pro is multilingual and multimodal in the sense that it can understand images, video, and, as of Tuesday, audio streams in addition to text, making the model capable of reading media such as TV shows, movies, and radio. You can also analyze and compare content. Broadcast in different languages, record conference calls, and more. 1 million tokens equals approximately 1 hour of video or approximately 11 hours of audio.
Thanks to its audio processing capabilities, the Gemini 1.5 Pro can also produce transcriptions of video clips, but the jury is out on the quality of its transcriptions.
In a pre-recorded demo from earlier this year, Google showed Gemini 1.5 Pro to search for humorous quotes in the transcript of the Apollo 11 moon landing television broadcasts (roughly 400 pages), and to search for humorous quotes in the footage from the movie. He showed how to find the following scene. pencil sketch.
1. Break down and understand long videos
We uploaded last night's entire NBA dunk contest and asked which dunk had the highest score.
Gemini 1.5 was incredibly able to find a specific perfect 50 dunk and its details just from understanding a long contextual video. pic.twitter.com/01iUfqfiAO
— Rowan Cheung (@rowancheung) February 18, 2024
Google says early users of Gemini 1.5 Pro, including United Wholesale Mortgage, TBS and Replit, are taking advantage of the large context window for tasks across mortgage underwriting. Automate metadata tagging of media archives. Generate, explain, and transform code.
Gemini 1.5 Pro can't handle a million tokens with a snap of its fingers. In the aforementioned demo, each search took between 20 seconds and 1 minute to complete. It was much longer than the average ChatGPT query.
However, Google has previously said that latency is an area of focus and that it is working on “optimizing” Gemini 1.5 Pro over time.
Notably, Gemini 1.5 Pro is gradually making its way into other parts of Google's enterprise product ecosystem, with the company announcing on Tuesday that the model (in private preview) will be available in Google's generative AI coding assistance tool. Announced that it will enhance a new function of Code Assist. Google says developers will now be able to make “large-scale” changes across the codebase, such as updating dependencies between files or reviewing large portions of code.