As companies move from experimenting with generative AI in limited prototypes to moving into production, they are becoming increasingly price-conscious. After all, using large language models isn't cheap. One way to reduce costs is to return to the old concept of caching. Another is to route simpler queries to smaller, more cost-effective models. AWS announced both of these features for its Bedrock LLM hosting service today at the re:invent conference in Las Vegas.
First, let's talk about cash advance services. “Say you have a document and multiple people are asking questions about the same document, and you're paying money every time,” Atul Deo, product director at Bedrock, told me. “And these context windows are getting longer and longer, for example 300k for Nova. [tokens of] context and 2 million [tokens of] context. I think it could be even higher by next year. ”
Image credit: AWS
Caching essentially eliminates the need to pay for models to perform repetitive work or reprocess the same (or substantially similar) queries over and over again. According to AWS, this can reduce costs by up to 90%, but one of the byproducts of this is that the latency to get a response from the model is significantly reduced (up to 85% according to AWS). reduction). Adobe tested prompt caching for some generated AI applications on Bedrock and found a 72% reduction in response time.
Another major new feature is Bedrock's intelligent prompt routing. This allows Bedrock to automatically route prompts to different models within the same model family, helping businesses strike the right balance between performance and cost. The system automatically predicts (using small language models) how each model will perform for a given query and routes requests accordingly.
Image credit: AWS
“Sometimes my question is a very simple one: Do I really need to send that query to the most capable model, which is very expensive and slow? Probably not. Basic Specifically, we want to create a concept where, at runtime, we send the right queries to the right models based on the incoming prompts,'' Deo explained.
Of course, LLM routing is not a new concept. Startups like Martian and many open source projects are also working on this, but AWS says what sets its service apart is that routers can intelligently direct queries with little human input. would argue. However, one limitation is that you can only route queries to models within the same model family. However, in the long term, Deo said the team plans to expand the system and provide even more customization to users.
Image credit: AWS
Finally, AWS is also launching a new marketplace for Bedrock. The idea here, Deo said, is that while Amazon partners with many major model providers, there are hundreds of dedicated models and only a few dedicated users. These customers are asking the company to support them, so AWS is launching a marketplace for these models. The only major difference is that users must provision and manage their own infrastructure capacity. Bedrock usually handles this automatically. In total, AWS offers approximately 100 of these emerging specialized models, with more on the way.