Table of Contents
The Truth Behind the LLM Battle- a recent lawsuit filed by The New York Times (NYT) against OpenAI, the creator of ChatGPT, has sparked a new conflict between major technology companies and news publishers. The NYT has raised concerns about the fundamental principles behind the training of LLM Battle, which are utilized in tools like ChatGPT.
This development could potentially have significant implications for news media organizations, as it may influence how they are valued in terms of their contribution to training language models, as determined by the courts. While many generative AI companies are currently addressing copyright issues retroactively, Apple has taken a proactive approach by engaging in commercial discussions with publishers and signing multimillion-dollar agreements to utilize licensed content.
OpenAI’s Approach and the New York Times Lawsuit
We will analyze the contrasting strategies employed by OpenAI and Apple when it comes to utilizing news content for training LLMs.
The copyright infringement lawsuit filed by the New York Times against OpenAI and its primary investor Microsoft alleges that OpenAI used millions of articles published by the news organization to train its chatbots, accusing it of engaging in “wide-scale copying.
” According to the lawsuit, OpenAI’s chatbots are now competing with the New York Times as a source of information. The lawsuit also mentions that Google and Wikipedia, along with the New York Times, are among the largest datasets scraped from the internet by Common Crawl, a non-profit web crawler, and that data from these sources has been used to partially train OpenAI’s GPT3 engines. The New York Times claims that OpenAI’s generative AI tools are capable of producing output that either directly quotes
Times content provides a closely summarized version or mimics the expressive style of the Times. The lawsuit further asserts that OpenAI’s tools undermine the relationship between the Times and its readers, resulting in a loss of subscription, licensing, advertising, and affiliate revenue for the media company. The New York Times also states that its paywall was breached, directly impacting its business model.
This marks the initial legal action taken by a prominent news publisher, although there have been previous instances where music labels, authors, and others have filed lawsuits for purported copyright infringements.
OpenAI’s Openness vs. Apple’s Ecosystem in the Race for LLM Battle
According to a report by the New York Times on December 22, Apple has proposed multiyear agreements valued at a minimum of $50 million to obtain licenses for news article archives. The tech giant has reached out to media companies like Conde Nast, which publishes renowned magazines such as Vogue and The New Yorker, as well as NBC News and IAC, the owner of People, The Daily Beast, and Better Homes and Gardens.
While there are still concerns regarding certain terms offered by Apple, executives at publishing firms have expressed approval for Apple’s approach of seeking permission to use content before implementing its generative AI model. This stands in contrast to other platforms that approach deals after already training their LLM Battle, as reported by the New York Times.
Various institutions such as NYT, BBC, CNN, Vox Media, and Reuters have already restricted OpenAI’s crawler from accessing their content. OpenAI has formed a partnership with Axel Springer, a German multinational media company that publishes Business Insider and Politico, to enable the sharing of their content with ChatGPT. In a similar manner, the Associated Press had also entered into a comparable agreement back in July.
Spill the beans! What’s the whole story?
Generative AI enterprises employ a technique known as web-scraping to gather extensive amounts of data from the internet. This data is then utilized as raw material for their LLM Battle. Subsequently, these LLM Battle process the data, enabling chatbots such as ChatGPT, Google’s Bard, and xAI’s Grok to partake in conversations with users that resemble human-like interactions. Additionally, these LLM Battle can generate images and sounds based on given prompts.
In June 2023, OpenAI was hit with a lawsuit accusing the company of extracting more than 300 billion words from the internet to train its software. The extracted content reportedly includes articles, posts, websites, books, and personal information, all obtained without user consent. The New York Times (NYT) revealed that it had initially approached OpenAI and Microsoft in April, expressing concerns about the usage of its material. The NYT even explored the possibility of establishing a commercial agreement and implementing “technological guardrails” for generative AI products. Unfortunately, these discussions failed to reach a resolution, leading the publisher to take legal action.
The New York Times (NYT) case, along with several other similar lawsuits in the United States, aims to tackle LLM Battle precise issue. Legal experts emphasize that these lawsuits will ultimately examine the different copyright laws in various jurisdictions. “Most copyright laws take into account the fair use doctrine when determining whether there is an infringement or not…the methods of testing this doctrine may differ, but the general concept is that certain uses of copyrighted material may be considered fair.”