Meta CEO Mark Zuckerberg appears to have used YouTube's battle to remove pirated content to defend his company's use of a dataset containing copyrighted e-books. , newly released excerpts from an affidavit filed late last year.
The deposition is part of a complaint filed with the court by plaintiffs' lawyers related to the AI copyright case Kadri v. Mehta. This is one of many similar cases in the U.S. court system pitting AI companies against authors and other intellectual property holders. In most cases, the AI companies that are defendants in these lawsuits claim that training on copyrighted content is “fair use.” Many copyright holders disagree.
“For example, I think YouTube might end up hosting content that people pirate for a period of time, but YouTube doesn't… We are trying to remove that content.” night. “And I think most of what's on YouTube is reasonably good and licensed.”
Excerpts from Zuckerberg's deposition provide insight into his thinking on copyrighted content and fair use. However, it should be noted that the full transcript of the deposition has not been made public. TechCrunch has reached out to Meta for additional information and will update the article if we hear back from the company.
Based on testimonial nuggets, Zuckerberg appears to be defending Meta's use of an e-book training dataset called LibGen to develop a family of AI models known as Llama. Meta's Llama competes with flagship models from AI companies such as OpenAI.
LibGen describes itself as a “link aggregator” and provides access to works from publishers such as Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued multiple times for copyright infringement, ordered to shut down, and fined tens of millions of dollars.
According to court filings made public this week, Mr. Zuckerberg did not allow Meta to train at least one of Meta's llama models, despite concerns about legal ramifications within the company's AI executives and research team. He is said to have authorized the use of LibGen.
Lawyers for the plaintiffs, including best-selling authors Sarah Silverman and Ta-Nehisi Coates, argued that Meta employees called LibGen “a data set that is known to be pirated” and that its use would be “undermining.” “There is a possibility,” he said. [Meta’s] It is negotiating its position with regulators, according to legal filings. ”
Zuckerberg claimed during his deposition that he had “actually never heard of” LibGen.
“I know you're trying to get me to give an opinion on LiveGen, but I haven't heard much about it,” Zuckerberg said during his deposition. “It's just that I don't have the knowledge about that specific thing.”
Under questioning from David Boies, one of the plaintiffs' lawyers, Zuckerberg explained why it would be unreasonable to ban the use of datasets like LibGen.
“So, do we want to have a policy that prohibits the use of YouTube because some content may be copyrighted? No,” he said. “[T]There are cases in which such a blanket ban is not appropriate. ”
Zuckerberg said Meta should be “very careful” about training on copyrighted material.
“You know, [if there’s] Someone is offering a website and they are intentionally trying to violate people's rights…obviously, that's something we want to be careful about and be careful about how we engage, and if In some cases, it even prevents our team from engaging with it,” Zuckerberg said. His deposition, records show.
new suspicion
Plaintiffs' lawyers in Kadri v. Mehta have amended the complaint several times since it was filed in 2023 in the U.S. District Court for the Northern District of California, San Francisco. The latest amended complaint, filed by plaintiffs' attorneys late Wednesday, includes: The new allegations against Meta include that the company cross-referenced certain pirated books in LibGen with licensable copyrighted books. Lawyers argue that Meta used this tactic to determine whether it made sense to enter into licensing agreements with publishers.
According to the amended filing, Meta used LibGen to train the latest Llama model family, Llama 3. Plaintiffs also claim that Meta is using this dataset to train the next generation Llama 4 model.
According to the amended filing, Meta researchers concealed the fact that Llama's model was trained on copyrighted material by inserting “supervised samples” into Llama's fine-tuning. It is said that he tried to do so. The amended complaint alleges that as recently as April 2024, Mehta downloaded pirated e-books for llama training from another source, Z-Library.
Z-Library (or Z-Lib) has been the subject of a number of legal actions brought by publishers, including domain seizures and takedowns. In 2022, the Russian national who allegedly maintained it was indicted on charges of copyright infringement, wire fraud, and money laundering.