Last updated on March 21st, 2024 at 08:04 am

Artificial intelligence firms face mounting pressure regarding their training data

OpenAI has stated that creating tools like its revolutionary chatbot ChatGPT would be unfeasible without access to copyrighted material, amid increasing pressure on artificial intelligence firms regarding their training data.

Chatbots like ChatGPT and image generators such as Stable Diffusion are “trained” using a vast dataset sourced from the internet, much of which is protected by copyright law—a legal safeguard against unauthorized use of someone’s work.

Last month, the New York Times filed a lawsuit against OpenAI and Microsoft—OpenAI’s key investor and a user of its tools in its products—alleging “unlawful use” of its content to develop their own products.

OpenAI, in a submission to the House of Lords communications and digital select committee, stated that training large language models like its GPT-4 model— which powers ChatGPT— would not be possible without access to copyrighted material.

“Because copyright now extends to nearly all forms of human expression—encompassing blog posts, images, forum discussions, snippets of code, and official documents—it would be unfeasible to train modern AI models without utilizing copyrighted content,” OpenAI explained in its submission, as initially reported by the Telegraph.

OpenAI further stated that restricting training data to books and drawings that are out of copyright would result in insufficient AI systems: “Restricting training data to public domain works created over a century ago might lead to an intriguing experiment but would not produce AI systems suitable for today’s requirements.”

In a blog post addressing the NYT lawsuit published on its website on Monday, OpenAI remarked, “We endorse journalism, collaborate with news organizations, and consider the New York Times lawsuit to be unfounded.”

Previously, the company stated that it respected “the rights of content creators and owners.” AI companies often rely on the legal doctrine of “fair use” to justify the use of copyrighted material, which permits the use of content in specific situations without obtaining the owner’s permission. In its submission, OpenAI expressed its belief that “legally, copyright law does not prohibit training.”

The lawsuit by The New York Times is one of several legal actions brought against OpenAI. In September, 17 authors, including John Grisham, Jodi Picoult, and George RR Martin, sued OpenAI, accusing the company of “systematic theft on a mass scale.”

Getty Images, which possesses one of the world’s largest photo collections, is suing Stability AI, the creator of Stable Diffusion, in the US and in England and Wales for alleged copyright infringements. In the US, a consortium of music publishers, including Universal Music, is suing Anthropic, the Amazon-backed company behind the Claude chatbot, for allegedly misusing a vast number of copyrighted song lyrics to train its model.

In its House of Lords submission, OpenAI expressed support for independent analysis of its security measures in response to a query about AI safety. The submission endorsed “red-teaming” of AI systems, a process in which third-party researchers evaluate the safety of a product by simulating the actions of malicious actors.

OpenAI is one of the companies that have agreed to collaborate with governments to conduct safety tests on their most advanced models both before and after deployment. This agreement was reached at a global safety summit in the UK last year.