Press ESC to close

Did OpenAI’s GPT-4o Model Use O’Reilly Media’s Paywalled Books Without Permission?

A new report by the AI Disclosures Project suggests that OpenAI may have trained its GPT-4o model using paywalled and non-public books from O’Reilly Media without permission.

Researchers tested 13,962 excerpts from 34 O’Reilly books to see how well GPT-4o and other models could recognize the content. The results show that GPT-4o identified far more paywalled content compared to GPT-3.5 Turbo, OpenAI’s earlier model.

This raises serious concerns that OpenAI may have used non-licensed data in its training process. According to the paper, O’Reilly Media has not granted OpenAI any license to use its content.

The study used a method called DE-COP, which helps determine whether specific content was likely part of a model’s training data.

However, the authors acknowledge that this is not definitive proof — some of the data may have been pasted into ChatGPT by users.

OpenAI has not commented on the report. The company has previously faced lawsuits over its use of copyrighted material. It is also known to have signed paid licensing agreements with certain content providers and introduced opt-out mechanisms for others.

As AI companies compete for high-quality training data, such allegations raise new questions about their compliance with legal and ethical standards.

Leave a Reply

Your email address will not be published. Required fields are marked *