
Newly unsealed court documents indicate that Meta employees have internally debated acquiring copyrighted materials through questionable means to train the company’s AI models.
These documents were submitted as part of the Kadrey v. Meta lawsuit, one of several legal battles surrounding AI and copyright issues. Meta argues that training its AI models on IP-protected works, particularly books, falls under “fair use,” while plaintiffs—including authors Sarah Silverman and Ta-Nehisi Coates—strongly disagree.
Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for AI training and that the company had halted negotiations with publishers regarding data licensing. However, the latest disclosures, including internal work chats, provide a more detailed picture of how Meta may have incorporated copyrighted data into its training sets.
Internal discussions on copyrighted data usage
According to chat logs from February 2023, Meta research engineer Xavier Martinet suggested bypassing licensing agreements with publishers and instead purchasing books outright to build training datasets. He advocated for a “move fast and ask for forgiveness later” approach, proposing that executives should make the final decision on the matter.
While some employees warned that using copyrighted materials without proper authorization could lead to legal challenges, Martinet downplayed the risk, arguing that numerous startups were likely already training AI models on pirated books.
Meanwhile, Melanie Kambadur, a senior manager on Meta’s Llama research team, mentioned that the company was in talks with platforms like Scribd for licensing agreements. She also noted that Meta’s legal team had become “less conservative” in approving the use of publicly available data for AI training, signaling a shift in the company’s risk tolerance.
Libgen and the use of unauthorized content
The court filings further reveal discussions about leveraging Libgen, a controversial site that provides access to copyrighted books without publisher authorization. Despite numerous legal actions against Libgen, Meta employees debated whether using its data could give the company a competitive edge.
In an email to Meta AI VP Joelle Pineau, product management director Sony Theakanath described Libgen as “essential” for achieving state-of-the-art (SOTA) performance across AI benchmarks. He also outlined potential strategies to mitigate legal risks, including filtering out files explicitly marked as “stolen” or “pirated” and refraining from publicly disclosing the use of Libgen data in training.
Additionally, Meta’s AI team reportedly configured models to reject potentially risky copyright-related prompts, such as requests to reproduce specific pages of books or disclose training data sources.
New allegations and Meta’s legal defense
The filings also suggest that Meta may have used Reddit data for training purposes, possibly by replicating the data collection behavior of third-party apps. This comes after Reddit announced in April 2023 that it would begin charging AI companies for access to its content.
Further discussions within Meta indicate that the company’s proprietary training datasets—including user-generated content from Facebook and Instagram—were insufficient for building competitive AI models. As a result, Meta leadership considered reversing previous decisions that had restricted the use of licensed books, scientific articles, and other high-quality data sources.
Plaintiffs in the case argue that Meta systematically compared pirated books with legally available ones to determine whether acquiring licenses would be necessary.
Recognizing the high stakes of the case, Meta has strengthened its legal team by adding Supreme Court litigators from the law firm Paul Weiss. As of now, Meta has not issued a formal statement on the matter.
Leave a Reply