Back to News

Research Suggests OpenAI Utilizes Copyrighted Material for AI Model Training

Thursday, Apr 3, 2025

Research Suggests OpenAI Utilizes Copyrighted Material for AI Model Training

A recent investigation by the AI Disclosures Project has brought to light concerns regarding the datasets employed by OpenAI for training its large language models (LLMs). The study suggests that the GPT-4o model from OpenAI shows a "strong recognition" of proprietary and protected information from O’Reilly Media publications.

Helmed by technologist Tim O’Reilly and economist Ilan Strauss, the AI Disclosures Project aims to mitigate the adverse societal effects of AI's commercial proliferation by pushing for more corporate and technological openness. Their paper draws attention to the lack of transparency in AI practices, comparing it to financial disclosure standards that help maintain healthy securities markets.

The research utilized a legally-sourced collection of 34 copyrighted books from O’Reilly Media to assess if OpenAI's LLMs were trained on copyrighted material without permission. The team employed the DE-COP membership inference attack method to evaluate if the models could distinguish between original human-authored O’Reilly texts and their LLM-generated paraphrases.

Significant insights from the investigation include:

Researchers suggest potential unauthorized access via the LibGen database, where all the O’Reilly books were available. They also note that newer LLMs have enhanced capabilities to differentiate between human-created content and machine-generated language, yet this hasn’t diminished the method's data classification capabilities.

The study points out the chances of "temporal bias" in the results, attributable to language evolution over time. As a countermeasure, the researchers evaluated two models (GPT-4o and GPT-4o Mini) trained on data from an identical period.

The findings, though specific to OpenAI and the O’Reilly Media works, likely indicate a broader issue surrounding the use of copyrighted data in AI training. The report argues that using uncompensated training material could potentially erode the internet's content quality and diversity, as financial avenues for professional content production decline.

The AI Disclosures Project underscores the importance of increased accountability in the preprocessing phases of AI models. They propose liability clauses encouraging better corporate transparency in revealing data sources as a step toward enabling commercial markets for training data licensing and compensation.

The EU AI Act's disclosure stipulations could spur a beneficial cycle of disclosure standards if accurately defined and enforced. Recognizing when their intellectual property is used in model development is essential for fostering AI markets for content creators' data.

Despite signs that AI companies might be acquiring data unlawfully for model training, a market has started to form where AI model creators pay for content through licensing arrangements. Companies like Defined.ai assist in purchasing training data by securing consent from data suppliers and eliminating personal information.

The report concludes that using data from 34 proprietary O’Reilly Media books substantiates claims that OpenAI's GPT-4o was likely trained using non-public, copyrighted content.

(Image by Sergei Tokmakov)

Latest News

Here are some news that you might be interested in.