A Matter of Ethics, Copyright, and Innovation
Multiple news outlets have barred the bot used to "crawl" and scan the internet for ChatGPT's training material from accessing their content, a gesture that could severely limit OpenAI's training.
As the conversation about the ethics and legality of web scraping intensifies, the decision by prominent publishers like the New York Times and CNN to block OpenAI's web crawler, GPTBot, from scraping their content deserves serious attention. This move may set a precedent for how we negotiate the boundaries between technological advancement and ethical considerations in the digital age.
Large language models (or LLMs), such as ChatGPT, require a staggering volume of data to simulate human-like interactions. While the prospect of a highly advanced, conversational AI is tantalising, the methodology behind these AI systems raises concerns. Companies like OpenAI often remain elusive about whether copyrighted material forms part of their vast training datasets.
The New York Times, first reported by The Verge, swiftly revised its terms of service to prevent its content from being used to train machine learning or AI systems. Such a move adds fuel to ongoing debates about intellectual property rights in the digital age, which go beyond OpenAI to encompass broader concerns about the unauthorised use of content. NPR reports that the New York Times is even considering legal action against OpenAI, suggesting that the paper might initiate a trend among publishers regarding data scraping for AI training.
But one can't ignore the elephant in the room: the ethics surrounding mass data scraping, particularly when the companies involved are vague about the presence of copyrighted content.
CNN confirmed its recent blockage of GPTBot, while the Guardian reports that Reuters, another major player, emphasised that intellectual property is their "lifeblood" and must be protected. Their position makes a strong point. In an age where content is increasingly digitised, traditional news outlets find themselves struggling to maintain revenue streams. Allowing potentially copyrighted material to train AI models can be seen as another blow to the already beleaguered journalism industry.
At the other end of the spectrum, we have tech industry advocates in Australia, and elsewhere, who argue for a more lenient approach toward AI and copyright laws. They caution that stringent copyright regimes could hinder technological advancement and economic investment in AI.
Herein lies the dilemma: How do we reconcile the need for innovation with ethical and legal imperatives?
Google has proposed that AI systems should be able to scrape the work of publishers unless they explicitly opt out.
In a recent update to its privacy policy, Google announced that it may use publicly available information to train its AI models and develop new products like Google Translate and Cloud AI. The company also submitted recommendations to the Australian government, advocating for copyright systems that permit the "appropriate and fair use" of copyrighted material for AI training, along with options for opting out.
"We may collect information that's publicly available online or from other public sources to help train Google's AI models and build products and features, like Google Translate, Bard and Cloud AI capabilities." - Google PDF July 2023
Google's stance emphasises the need for a balanced copyright system that doesn't stifle innovation. However, the core issue remains: How do we balance the rapid advancement of AI technologies with ethical considerations and the rights of content creators?
Google’s policy update and its advocacy for flexible copyright laws in Australia hint at the broader challenges we face in establishing a regulatory framework that supports both technological innovation and ethical responsibility.
The decision by publishers to block or allow OpenAI's web crawlers could very well set a precedent.
It’s a complex issue that goes beyond the question of whether large language models like ChatGPT should be trained on copyrighted text.