AI of the Week: Big tech companies embrace synthetic data

Hello everyone, welcome to TechCrunch's regular AI newsletter. If you'd like to have this sent to your inbox every Wednesday, sign up here.

Synthetic data took center stage in AI this week.

Last Thursday, OpenAI introduced Canvas, a new way to interact with ChatGPT, an AI-powered chatbot platform. Canvas opens a window with a workspace for creating and coding your project. Users can generate text or code in Canvas and optionally use ChatGPT to highlight sections for editing.

From a user's perspective, Canvas greatly improves quality of life. But what's most interesting about this feature for us are the tweaked models that enhance its functionality. OpenAI said it uses synthetic data to tune the GPT-4o model and “enable new user interactions” with Canvas.

“Using new synthetic data generation techniques, such as distilling output from OpenAI’s o1 preview, we can fine-tune GPT-4o to open canvases, make targeted edits, and inline high-quality comments. ” said Nick Turley, ChatGPT Product Director. “This approach allowed us to quickly improve our models and enable new user interactions without relying on human-generated data.”

OpenAI isn't the only Big Tech company increasingly relying on synthetic data to train its models.

In developing Movie Gen, a suite of AI-powered tools for creating and editing video clips, Meta relied in part on synthetic captions generated by deriving the Llama 3 model. The company hired a team of human annotators to fix errors and add detail to these captions, but much of the groundwork was largely automated.

OpenAI CEO Sam Altman has argued that AI will one day generate enough synthetic data to effectively train itself. This could be an advantage for companies like OpenAI, which spend a lot of money on human annotators and data licenses.

Meta used the synthetic data to fine-tune the Llama 3 model itself. And OpenAI is said to be sourcing synthetic training data from o1 for its next-generation model, code-named Orion.

But taking a synthetic data-first approach comes with risks. As one researcher recently pointed out to me, the models used to generate synthetic data are inevitably hallucinatory (i.e., fabricated) and contain biases and limitations. These flaws appear in the data produced by the model.

Therefore, to use synthetic data safely, it must be thoroughly curated and filtered, similar to standard practices for human-generated data. Failure to do so can lead to model collapse, making the model's output less “creative” and more biased, ultimately severely impairing its functionality.

This is not easy on large scale operations. But as real-world training data becomes more costly (not to mention difficult to obtain), AI vendors may see synthetic data as the only viable path forward. I hope they are cautious in adopting it.

news

Ads in AI Overview: Google has announced that it will soon start showing ads in AI Overview, an AI-generated summary for specific Google search queries.

Google Lens now has video: Lens, Google's visual search app, has been upgraded to answer questions about your surroundings in near real-time. You can capture video through Lens and ask questions about objects of interest in the video. (There will probably be ads for this as well.)

From Sora to DeepMind: Tim Brooks, one of the leaders of OpenAI's video generator Sora, has left for rival Google DeepMind. In a post on X, Brooks announced that he would be working on video generation technology and a “world simulator.”

Fluidization: Black Forest Labs, the Andreessen Horowitz-backed startup that develops the image generation component for xAI's Grok assistant, has opened its API to the public in beta and released a new model.

Not so transparent: California's recently passed AB-2013 bill requires companies that develop generative AI systems to publicly provide a summary of the data used to train their systems. So far, few companies have said whether they will comply. The law gives a deadline of January 2026.

This week's research paper

Apple researchers have been hard at work researching computational photography for years, and a key aspect of the process is depth mapping. Initially, this was done using specialized depth sensors such as stereoscopic or LIDAR units, but these tend to be expensive, complex, and take up valuable internal space. Running strictly within software is desirable in many ways. That's what this paper “Depth Pro” is all about.

Alexei Bochkovsky et al. We share a method for high-detail zero-shot monocular depth estimation. This means you use a single camera, you don't have to be trained on certain things (like it works even though you've never seen a camel), and it also captures difficult aspects such as: A bunch of hair. It's almost certainly being used on the iPhone now (though probably in an improved custom-built version), but if you want to do a little bit of your own depth estimation using the code on this GitHub page, you can give it a try. can.

this week's model

Google has released a new model in the Gemini family, Gemini 1.5 Flash-8B. This is claimed by the company to be one of its best performing models.

Gemini 1.5 is a “distilled” version of Flash, already optimized for speed and efficiency. Gemini 1.5 Flash-8B offers 50% lower usage costs, lower latency, and 2x higher rate limits in Google's AI-focused AI Studio. Developer environment.

“Flash-8B nearly matches the performance of the 1.5 Flash model launched in May on many benchmarks,” Google wrote in a blog post. “Our model is [continue] Informed by developer feedback and our own testing of what's possible. ”

Google says the Gemini 1.5 Flash-8B is suitable for chatting, transcription, translation, or other “simple” and “high-volume” tasks. In addition to AI Studio, this model is also available for free through Google's Gemini API. Rate limit is 4,000 requests per minute.

grab bag

Speaking of cheap AI, Anthropic released the Message Batches API, a new feature that allows developers to process large numbers of AI model queries asynchronously at low cost.

Similar to batching requests to Google's Gemini API, Anthropic's Message Batches API allows developers to submit fixed-sized batches of up to 10,000 queries per batch. Each batch is processed within 24 hours and costs 50% less than standard API calls.

Anthropic says the message batch API is ideal for “large scale” tasks such as dataset analysis, classification of large datasets, and model evaluation. “For example, analysis of an entire corporate document repository, which may contain millions of files, [this] Bulk discount. ”

The Message Batch API is available in public beta with support for Anthropic's Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku models.

Source link

Subscribe to Updates

What's Hot