OpenAI: Give Us Your Content or Die

April 30, 2024 E Jazz Reporter

The Financial Times announced a deal with OpenAI on Monday to license its world-class journalism for training and informing ChatGPT’s models. It joins Axel Springer and the Associated Press who struck similar deals, where OpenAI reportedly offers millions for the right to use content. However, ChatGPT was trained on lots of other web-scraped content that OpenAI did not pay for. So why is OpenAI paying for some datasets and not others?

Why is Everyone Suing AI Companies? | Future Tech

OpenAI’s licensing deals seem to send a clear message: we’re going to use your content anyway, so sign a deal with us or get left behind. The main perk of a licensing deal seems to be a prominent spot in ChatGPT’s answers. Some publishers may also want to solidify a relationship with the next big information distribution channel before it takes over. However, it seems OpenAI is using a lot of publishers’ content anyways.

OpenAI already trains its AI models in part on “publicly available data” according to CTO Mira Murati, which seems purposefully vague. What is publicly available data anyway? The phrase assumes anything free to read on the internet is also free to build into ChatGPT. For instance, Gizmodo is part of OpenAI’s “publicly available data.” Our website was cached over 34,000 times on GPT-2’s WebText dataset, the last dataset OpenAI disclosed using to train an AI model.

Gizmodo is free for readers largely due to the ads on this webpage. If readers can access our content through ChatGPT that breaks our business model. The New York Times, which is used significantly more in GPT-2’s WebText dataset, sued OpenAI for copyright infringement over this very matter.

A content licensing deal with OpenAI seems like the only way for publishers to stay relevant in the AI era. In a press release, the Financial Times Group CEO John Ridding says this deal “will broaden the reach” of their work while offering “early insights into how content is surfaced through AI.”

“The thing about AI is it’s not really artificial intelligence,” said Matthew Butterick, a lawyer representing Sarah Silverman and other book authors suing OpenAI, in an interview with Gizmodo. “It’s human intelligence which has been harvested from one place, divorced from its creators, then this big tech company puts a price tag on it and sells it to someone else.”

Butterick is the plaintiff in six copyright lawsuits against AI companies. He’s also a writer, coder, and designer, so he says he understands how AI can threaten these industries. Generally speaking, his cases center around a claim that AI simultaneously uses the work of creators and threatens their livelihood.

OpenAI’s licensing deals raised an eyebrow around the content ChatGPT uses for free. Tech companies have argued that generative AI is a “fair use” of copyrighted works because it transforms them into something new. The AI world has also argued that it’s using a similar model to Google Search, which caches copyrighted content to create a useful, information-finding tool. Similar to Google, AI chatbots have recently started including hyperlinks. Ultimately, a court will have to decide whether generative AI is a “fair use.”

OpenAI did not immediately respond to Gizmodo’s request for comment.

Book authors and publishers are not the only ones OpenAI seems to be taking content from. The New York Times recently reported that OpenAI trained GPT-4 on over one million hours of transcribed YouTube videos. Days before the report came out, YouTube’s CEO said using its videos for AI training would be a “clear violation” of its policies.

OpenAI’s content licensing deals muddy the waters of the discussion. The company is somehow using internet content for free, while also paying others for their work. Other tech companies, such as Apple, have reportedly been more proactive about paying for all their training data. Adobe reportedly paid $3 per minute of video to train its AI video generator.

However, it’s unclear if even a one-time payment for obtaining AI training data is sufficient. We’re talking about a tool that could potentially invert the media industry for writers, audio and video producers, and more. Signing a deal with OpenAI could guarantee you a good spot in ChatGPT’s results, but it seems like the AI chatbot may have been using your content anyway. At least for now, AI companies are keen to use everything on the internet and ask questions about the legality of it all later.