In its pitch to investors last spring, Anthropic said it would build AI to power virtual assistants that can perform research, respond to emails and handle other back-office tasks on their own. The company calls it a “next generation algorithm for AI self-learning” and believes that if all goes according to plan, it could one day automate large parts of the economy.
It took a while, but AI is starting to emerge.
Anthropic on Tuesday released an upgraded version of its Claude 3.5 Sonnet model that can understand and interact with any desktop app. Via a new “Computer Usage” API, currently in open beta, the model can mimic keystrokes, button clicks, and mouse gestures, essentially emulating a person sitting at a PC. .
“We trained Claude to see what's happening on the screen and use the available software tools to accomplish the task,” Anthropic said in a blog post shared with TechCrunch. I wrote it. “When a developer asks Claude to use computer software and gives him the necessary access, Claude sees a screenshot of what the user is seeing and can move the cursor vertically or horizontally to click. Count the number of pixels you need. You're in the right place.”
Developers can experiment with computing through Anthropic's API, Amazon Bedrock, and Google Cloud's Vertex AI platform. The new non-computer 3.5 Sonnet has been rolled out to the Claude app and offers various performance improvements over the previous 3.5 Sonnet model.
App automation
Tools that can automate tasks on your PC aren't a new idea. Countless companies offer such tools, from decades-old established RPA vendors to startups like Relay, Induced AI, and Automat.
In the race to develop so-called “AI agents,” the field is only getting more crowded. AI agent remains an ill-defined term, but generally refers to AI that can automate software.
Some analysts say AI agents could offer companies an easy path to monetizing the billions of dollars they are pouring into AI. Companies seem to agree. According to a recent study by Capgemini, 10% of organizations are already using AI agents and 82% plan to integrate them within the next three years.
Salesforce made some splashy announcements about its AI agent technology this summer, and Microsoft yesterday touted new tools for building AI agents. OpenAI, which plans its own branded AI agent, sees this technology as a step toward super-intelligent AI.
Anthropic calls the AI agent concept an “action execution layer” that allows the new 3.5 Sonnet to execute desktop-level commands. Thanks to the ability to browse the web (a first for an AI model, but a first for Anthropic), the 3.5 Sonnet can use any website and application.
Anthropic's new AI can control apps on your PC. Image credit: Humanity
“Humans maintain control by providing specific prompts that direct Claude's actions, such as 'Please use your computer and online data to fill out this form,'” an Anthropic spokesperson said. told TechCrunch. “Users can enable or restrict access as needed. Claude breaks down the user's prompts into computer commands (such as moving the cursor, clicking, typing, etc.) to accomplish that specific task. Execute.”
Software development platform Replit has used an early version of the new 3.5 Sonnet model to create an “autonomous validator” that can be evaluated while building apps. Meanwhile, Canva said it is exploring ways the new model can support the design and editing process.
But how is this different from other AI agents? That's a fair question. Rabbit, a consumer gadget startup, builds web agents that can do things like buy movie tickets online. Adept, recently acquired by Amazon, trains models to browse websites and interact with software. Twin Labs uses off-the-shelf models such as OpenAI's GPT-4o to automate desktop processes.
Anthropic claims the new 3.5 Sonnet is a more powerful and robust model that can perform better on coding tasks than OpenAI's flagship o1, according to SWE bench-validated benchmarks. Despite not being explicitly trained to do so, the upgraded 3.5 Sonnet will self-correct and retry tasks when it encounters obstacles, requiring tens or hundreds of steps. You can work towards your goals.
Performance of the new Claude 3.5 Sonet model on various benchmarks. Image credit: Humanity
But don't fire your secretary just yet.
In an evaluation aimed at testing the AI agent's ability to assist airlines with reservation tasks, such as changing flight reservations, the new 3.5 Sonet successfully completed less than half of the tasks. In another test involving tasks such as initiating returns, the 3.5 Sonnet failed about a third of the time.
Anthropic says the upgraded 3.5 Sonnet struggles with basic operations like scrolling and zooming, and the way it takes screenshots and stitches them together can make it possible to miss “short-term” actions and notifications. I admit that I have sex.
“Claude's computer usage remains slow and error-prone,” Anthropic wrote in its post. “We encourage developers to start exploring with low-risk tasks.”
risky business
But is the new 3.5 Sonet dangerously capable? Probably.
A recent study found that models without the ability to use desktop apps, like OpenAI's GPT-4o, “if attacked, could perform a harmful multi-step process, such as ordering a fake passport from someone on the dark web.” agent behavior.'' ” using jailbreak techniques. The researchers found that even on models protected by filters and safeguards, jailbreaking resulted in a high rate of successful execution of harmful tasks.
One can imagine that a model with desktop access could cause even greater disruption, such as by exploiting vulnerabilities in apps to compromise personal information (or store chats in clear text). Apart from the software levers at your disposal, this model's online and app connectivity could open the way for malicious jailbreakers.
Anthropic doesn't deny that there are risks with the new 3.5 Sonnet release. But the company argues that the benefits of seeing how the model is used in practice ultimately outweigh this risk.
“We believe it is far better to provide computer access to today's more limited and relatively secure models,” the company wrote. “This means we can begin to observe and learn from potential problems that occur at this lower level, and build computer usage and safety mitigations incrementally and simultaneously.”
Image credit: Humanity
Anthropic also says it has taken steps to prevent abuse, including not training the new 3.5 Sonnet based on user screenshots or prompts and preventing models from accessing the web during training. The company says it has developed a classifier to “guide” 3.5 Sonnet from activities deemed high-risk, such as posting on social media, creating accounts, and interacting with government websites.
As the U.S. general election approaches, Anthropic said it is focused on mitigating election-related abuse of its model. The US AI Safety Institute and the UK Safety Institute, two separate but collaborating government agencies specializing in risk assessment of AI models, tested the new 3.5 Sonnet before it was introduced.
Anthropic told TechCrunch that it has the ability to restrict access to additional websites and features “as needed” to protect against spam, fraud, misinformation, and more. As a security measure, the company stores screenshots captured from computer usage for at least 30 days. This retention period may alarm some developers.
We asked Anthropic under what circumstances it would pass on screenshots to third parties (such as law enforcement) if requested. A spokesperson said the company “will comply with requests for data pursuant to valid legal process.”
“There is no foolproof method. We will continually evaluate and iterate on safety measures to balance Claude's functionality with responsible use,” Antropic said. “Those using the computer version of Claude should take appropriate precautions to minimize this type of risk, including isolating Claude from particularly sensitive data on their computers.”
I think it's enough to prevent the worst from happening.
cheap model
Today's highlight may have been the upgraded 3.5 Sonnet model, but Anthropic also said an updated version of the Haiku, the cheapest and most efficient model in its Claude series, is coming soon. .
Claude 3.5 Haiku, scheduled to be released in the coming weeks, will match the performance of Claude 3 Opus, Anthropic's former cutting-edge model, on certain benchmarks at the same cost and “approximate speed” as Claude 3 Haiku. Masu.
“Claude 3.5 Haiku offers lower latency, easier to follow instructions, and more precise tools to personalize products for users, specialized subagent tasks, and vast amounts of data such as purchase history, pricing, and more. “Inventory data is perfect for generating customized experiences,'' Anthropic wrote in a blog post.
3.5 Haiku will be available initially as a text-only model and later as part of a multimodal package that can analyze both text and images.
3.5 Haiku benchmark performance. Image credit: Humanity
So, will there be more reason to use 3 Opus once 3.5 Haiku is available? What about 3.5 Opus, the successor to 3 Opus that Anthropic teased back in June?
“Every model in the Claude 3 model family has individual applications for customers,” said an Anthropic spokesperson. “Claude 3.5 Opus is on our roadmap and we plan to share more details as soon as possible.”
TechCrunch has a newsletter focused on AI. Sign up here to get it delivered to your inbox every Wednesday.