Reddit's prospects for a stock market listing have more to do with its relationships with AI vendors like OpenAI than expected.
In its IPO prospectus filed today with the U.S. Securities and Exchange Commission, Reddit said it expects to profit from more than 1 billion posts and a data licensing agreement with a company that trains AI models on more than 16 billion posts. He repeatedly emphasized whether he was there or what he had gained. comment.
“In January 2024, we entered into certain data license agreements with a total agreement amount of $203 million and terms ranging from two to three years,” the prospectus states. “We expect to recognize at least $66.4 million in revenue for the year ending December 31, 2024 and the remainder of the period thereafter.”
So far, it's a mystery which AI vendors are licensing data from Reddit. Earlier this week, Bloomberg and Reuters reported that an “unknown large AI company” (likely Google) had signed a licensing deal worth about $60 million annually. But especially considering that OpenAI CEO Sam Altman owns an 8.7% stake in Reddit (making him its third-largest shareholder) and was once a member of the company's board of directors. OpenAI wouldn't be a surprising customer either.
Why is Reddit's data valuable? Reddit explains that AI models “learn” essays, code, emails, articles, etc. from samples, and vendors like OpenAI scrape these samples from the web. retrieve millions to billions of results and add them to the training set. Some examples are in the public domain. Others, and in the case of Reddit content, may be subject to restrictive licenses that require citation or certain forms of compensation.
Reddit had not previously restricted access to data for AI training purposes. But the company reversed course last year, saying its data shows, in the words of CEO Steve Huffman, that it shouldn't be.[given] Free to some of the world's largest companies. ”
“[Our] “Data APIs can provide real-time access to evolving and dynamic topics such as sports, movies, news, fashion, and the latest trends,” the prospectus continues. “We believe that Reddit's vast corpus of conversation data and knowledge will continue to play a role in training and improving language models at scale. Our content is updated and growing daily, so , we expect the model to reflect these new ideas and update its training using Reddit data.”
Content creators, from stock media libraries to news publishers, are increasingly relying on data licensing agreements with AI vendors as chatbots like OpenAI's ChatGPT and Google's Gemini threaten to dry up their traffic. ing. A recent model in The Atlantic found that when search engines like Google integrate AI into their searches, they can answer users' queries 75% of the time without requiring a click-through to a website. I did.
Meanwhile, vendors are facing a slew of lawsuits that claim they have no legal legitimacy to train models on the data without permission or payment, prompting them to seek licensing agreements. The New York Times recently accused OpenAI of using its copyrighted material to effectively create a competitor for news publishers, harming its own business.
For example, OpenAI has deals with image gallery Shutterstock as well as publishers including Politico and Axel Springer, owner of Business Insider. However, licenses are reported to be very small, with a maximum value of $5 million per year.