OpenAI's legal battle with the New York Times over data used to train AI models may still be ongoing. However, OpenAI is in the process of signing deals with other publishers, including some of the largest news publishers in France and Spain.
OpenAI announced Wednesday that it has signed an agreement with Le Monde and Prisa Media to bring French and Spanish language news content to OpenAI's ChatGPT chatbot. OpenAI said in a blog post that the partnership will bring the organization's latest event coverage from brands like El País, Cinco Días, As, and El Huffpost in front of ChatGPT users wherever it makes sense, and that OpenAI's previous He said he would contribute to the project. – The amount of training data is expanding.
OpenAI writes:
In the coming months, ChatGPT users will be able to interact with relevant news content from these publishers through selected summaries with attribution and enhanced links to the original articles, allowing users to Gain access to additional information and related articles. We continually improve ChatGPT to support the news industry's important role in delivering real-time, trusted information to users.
Therefore, the licensing agreement OpenAI has disclosed is with a small number of content providers at this time. I felt that now was a good time to think about myself.
Stock Media Library Shutterstock (for images, videos, music training data) Associated Press Axel Springer (owner of Politico and Business Insider among others) Le Monde Prissat Media
How much does OpenAI pay each? Well, at least not publicly. But we can make an estimate.
The Information reported in January that OpenAI is offering publishers between $1 million and $5 million annually for access to its archives to train GenAI models. This doesn't tell us much about the partnership with Shutterstock. But on the article licensing side, assuming The Information's reporting is accurate and the numbers haven't changed since then, OpenAI is paying between $4 million and $20 million a year for news.
This may be a small drop in the bucket for OpenAI. OpenAI's highest funding was over his $11 billion, and its annual revenue recently topped $2 billion (according to the Financial Times). But as Homebrew partner and Screendoor co-founder Hunter Walk recently mused, this has the potential to overtake AI rivals that are also pursuing licensing deals.
Wolk wrote on his blog:
[I]If experiments are gated by nine-figure licensing agreements, we're doing a disservice to innovation…for challengers because checks on “ownership” of training data are cut. There are significant barriers to entry. If Google, OpenAI, and other big tech companies can set costs high enough, they are implicitly deterring future competition.
Now, it's debatable whether there are any barriers to entry today. Many, if not most, AI vendors have chosen not to license the data they use to train their AI models, incurring the ire of IP holders. For example, there is evidence that her art generation platform Midjourney trains on still images from Disney movies, but Midjourney does not have a contract with Disney.
A more difficult question to address is whether licensing should simply be a cost of doing business and experimentation in the AI field.
Mr. Wolk would argue that this is not the case. He advocates for a regulator-imposed “safe harbor” that protects all AI vendors, not just small startups and researchers, from legal liability as long as they adhere to certain transparency and ethical standards. There is.
Interestingly, the UK recently sought to codify something along these lines, exempting the use of text and data mining for AI training from copyright consideration as long as it is for research purposes. However, those efforts ultimately failed.
Personally, I'm not sure I'd go that far with Wolk's “safe harbor” proposal, given the impact AI could have on an already destabilized news industry. A recent model in The Atlantic found that when search engines like Google integrate AI into their searches, they can answer users' queries 75% of the time without requiring a click-through to a website. I did.
But perhaps there is room for a carve-out.
Publishers should be compensated, and they should be paid fairly. But will there be an outcome where they are paid and AI challengers as well as academics have access to the same data as those incumbents? I'm sure you think so. Grants are one-way. Larger VC checks are another.
It's hard to say there's a solution, especially given that courts have yet to decide whether and to what extent fair use protects AI vendors from claims of copyright infringement. However, it is important to clarify these things. Otherwise, the industry could end up in an unending academic “brain drain” where only a few powerful companies have access to a vast pool of valuable training sets.