Meta Pirated 81.7TB of Books to Train AI

Meta Pirated 81.7TB of Books to Train AI

Meta’s ongoing legal battle over its alleged misuse of copyrighted books to train AI models has taken a significant turn. Newly unsealed emails suggest that the social media giant engaged in large-scale torrenting of pirated books from shadow libraries, potentially undermining its legal defense.

The lawsuit, filed by a group of book authors, claims that Meta illegally trained its AI models using data obtained from sites like LibGen and Z-Library. While Meta previously admitted to downloading datasets from these sources, the full extent of the activity was unclear until now. Court filings reveal that Meta torrented at least 81.7 terabytes of data from shadow libraries, including 35.7 terabytes specifically from Z-Library and LibGen. Additionally, Meta allegedly previously torrented 80.6 terabytes from LibGen alone.

The scale of the alleged infringement is staggering. The authors’ legal team emphasized that even vastly smaller instances of data piracy have led to criminal investigations. This raises the stakes for Meta, which now faces not only civil litigation but potential legal consequences beyond financial penalties.

Did Meta Know It Was Breaking the Law?

Internal communications from Meta staff suggest growing concerns about the legality of their data acquisition methods. In an April 2023 email, Meta research engineer Nikolay Bashlykov voiced unease about using corporate laptops to torrent pirated books, even adding a smiley emoji to his message:

"Torrenting from a corporate laptop doesn’t feel right. 🙂"

By September 2023, Bashlykov escalated his concerns, warning that using torrents would involve “seeding” the files—essentially sharing the pirated content with others—which could have serious legal implications.

Despite these warnings, the company allegedly continued its activities, taking steps to obscure its methods. Meta researcher Frank Zhang described efforts to keep the torrenting “stealth mode”, while Meta executive Michael Clark admitted in a deposition that settings were modified to minimize seeding.

Meta’s Defense: Fair Use or Strategic Concealment?

Meta has consistently argued that its AI training practices fall under fair use, a legal doctrine that allows limited use of copyrighted material under specific circumstances. However, the new evidence challenges this stance.

One of the most damaging revelations is that Meta allegedly took deliberate steps to hide its torrenting activity, avoiding the use of Facebook servers to prevent tracking. This contradicts earlier testimonies and suggests a conscious effort to circumvent copyright laws.

Additionally, the authors claim that Meta CEO Mark Zuckerberg was involved in the decision to use LibGen, despite his previous statements distancing himself from the matter. Unredacted messages reportedly show that the decision was made after an escalation to Zuckerberg, raising questions about executive-level awareness and approval of the practice.

What’s Next for Meta?

The latest evidence has prompted the authors to demand further depositions of Meta employees involved in the decision-making process. Meta, on the other hand, maintains that the allegations are overblown, arguing that there’s no proof that any copyrighted books were redistributed to third parties as a result of its torrenting.

Still, the legal landscape is shifting. With the new information bolstering claims of direct copyright infringement, Meta may find it increasingly difficult to rely on its fair use defense. And with discovery on the seeding issue still ongoing, the company may have more legal battles ahead as it attempts to “set the record straight.”

For now, Meta faces a deepening legal crisis that could have significant implications for how AI companies acquire and train on copyrighted material. If the court finds Meta’s actions unlawful, it could set a precedent affecting the entire AI industry’s approach to data sourcing and copyright compliance.

Read more