Plaintiffs' attorneys in a copyright lawsuit filed against Meta say that Meta CEO Mark Zuckerberg has accused the company's Llama AI model development team of selling pirated e-books. The company claims to have given permission to use the data set from the article and for training purposes.
Kadry v. Mehta is one of many lawsuits against major technology companies developing AI, accusing them of training models on copyrighted works without permission. In most cases, defendants like Mehta rely on fair use, a U.S. legal doctrine that allows copyrighted works to be used to create something new, as long as it is sufficiently transformative. They claim to be protected. Many creators reject that argument.
In new, unredacted documents filed late Wednesday in the U.S. District Court for the Northern District of California, the plaintiffs in Kadry v. Mehta, including best-selling authors Sarah Silverman and Ta-Nehisi Coates, say that since late last year details Mr. Mehta's testimony. Zuckerberg revealed that Mehta has approved the use of a dataset called LibGen for llama-related training.
LibGen describes itself as a “link aggregator” and provides access to works from publishers such as Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued multiple times for copyright infringement, ordered to shut down, and fined tens of millions of dollars.
According to Mehta's testimony relayed by the plaintiff's attorney, Zuckerberg will use LibGen to train at least one of Mehta's Llama models, despite concerns from Mehta's AI executive team and other members of the company. was allowed. The filing quotes a Meta employee as calling LibGen a “dataset known to be pirated” and warning that its use “may undermine value.” [Meta’s] We are negotiating our position with regulators. ”
The filing also cites Meta AI's memo to decision makers, which states that after “escalation to MZ,” Meta's AI team:[was] LibGen is approved for use. ” (MZ here is clearly an abbreviation for “Mark Zuckerberg.”)
The details appear to be consistent with a New York Times report from April last year that suggested Meta was cutting corners in collecting data for AI. At one point, Meta hired a contractor to aggregate book summaries in Africa and was considering acquiring publisher Simon & Schuster, the paper said. But company executives decided licensing negotiations would take too long and argued that fair use was a solid defense.
Wednesday's filing includes new accusations, including that Meta may have tried to cover up the alleged breach by distributing LibGen data.
According to the plaintiffs' lawyers, Nikolai Vasilikov, a meta-engineer working with the Llama research team, wrote a script that removed copyright information, including the words “copyright” and “acknowledgments,” from LibGen's e-books. Separately, Meta allegedly removed copyright markers from scientific journal articles and the “source metadata” of the training data used for Llama.
“This finding suggests that meth strips. [copyright information] “It is not only for training purposes, but also for the purpose of hiding copyright infringements, because by stripping away copyrighted works, copyright infringement may alert Llama users and the general public to meta infringements. “This is because Llama will not be able to output the information,” the application states.
According to the latest filing, Meta also revealed during a deposition that it torrented LibGen, a move that gave some Meta research engineers pause. Torrenting, a method of distributing files on the web, requires torrenters to simultaneously “seed” or upload the files they are trying to retrieve.
Plaintiffs' attorneys argue that by torrenting LibGen, Meta effectively committed another form of copyright infringement and helped spread its content. Meta also tried to hide its activities by minimizing the number of files it uploaded, lawyers allege.
According to the filing, Meta's head of generative AI, Ahmad A Dar, said he has “paved the way” for LibGen to be torrented, and that doing so “may not be legally OK.” Mr. Vasilkov's reservations were ignored.
“Had Meta purchased Plaintiff's work at a bookstore or borrowed it from a library and trained a llama model without a license, it would have committed copyright infringement,” plaintiffs' lawyers wrote in the filing. There is. “Meta's decision to circumvent legal methods of obtaining books and knowingly participate in illegal torrent networks constitutes evidence of copyright infringement.”
The case against Mehta has not yet been resolved. At the moment, it only concerns Meta's early Llama models and not recent releases. And if the court is persuaded by Meta's fair use argument, there is a good chance it will rule in Meta's favor.
But the allegations bode poorly for Meta, as the judge overseeing the case, Judge Thomas Hickson, said in an order Wednesday denying Meta's request to redact most of its filings. said.
“It is clear that Meta's sealing requirements were not designed to prevent the disclosure of confidential business information that could be used to advantage by competitors,” Hixson wrote. “Rather, it is designed to avoid bad publicity.”
We have reached out to a Meta spokesperson for comment and will update this article if we hear back.