Deciphering The New York Times Lawsuit Against OpenAI And Microsoft: The Reality Is Always Different From The Press Release
According to LangChain’s State of AI 2023, OpenAI and OpenAI on Azure are the two most used LLM providers. The New York Times believes it deserves a slice of the LLM pie. OpenAI and the NYT have been in discussions over compensation since early this year (2023). While OpenAI has reached agreements with many other outlets, they haven’t reached one with the NYT.
Apple has paid as much as $50 million for multi-year deals with publishers. That’s probably close to what OpenAI offered, and my speculation is the NYT wants more. More could mean a longer deal or profit sharing, but the bottom line is the NYT wants a lot more money than OpenAI was offering, or they wouldn’t have taken this route.
Fighting For A New Pricing Model
In my AI Product Management Certification course, I teach pricing strategies for data products like the NYT dataset (I use X and Reddit for my use cases). I harp on pricing so much because all the companies that have taken the flat pricing offers will wish they hadn’t. They don’t realize how much their data is really worth or how small their monetization window is. In 2-3 years, their datasets will be unnecessary for foundational model companies. (If there’s interest in understanding why, let me know in the comments. I’ll build another post to explain.)
Profit sharing or royalties is a pricing model that makes the most sense for data products that foundational model providers use. It values the data in line with how the foundational model provider monetizes it. In the digital paradigm, companies charge for API access (subscription) or sell the dataset download as a one-time charge.
Data is a novel asset class, and businesses must price data products to reflect novel monetization paradigms. One dataset can be used to train multiple foundational models that are monetized multiple times. Selling access to the data with the digital pricing paradigms leaves a lot of money on the table. Profit sharing or royalties gives dataset owners recurring revenue that aligns with the way foundational model providers get paid.
In my view, that’s what the NYT wants and what OpenAI isn’t willing to give up. Every other data provider will want something similar if they reach a royalties deal with the NYT. Those deals will eat into profit margins for foundational models that are already expensive to build and operate.
The Lawsuit And Setting Data Pricing Standards
The NYT filed suit in the Southern District of New York against Microsoft and OpenAI. The first piece that jumps out at me is the request for a jury trial. The NYT believes a jury is more likely to side with it on a royalties deal than a judge is.
Judges are more rational and act from precedents. Juries are much more unpredictable, and their awards are often higher when they see a case of a big business taking advantage of others. The request for a jury trial increases the uncertainty factor around the potential outcome and makes it more likely that a new precedent could be set.
The NYT also went out and found top-tier attorneys. Susman Godfrey has one of the best track records against Big Tech. This lawsuit has been well thought out, and there’s more substance here than most have discussed. Most discussions are one-sided and miss the deeper implications.
The NYT doesn’t claim that OpenAI stole its data. It believes OpenAI got the data from the Common Crawl or other legal methods (like internet searches). The suit states that the Common Crawl is the most highly weighted dataset in GPT-3, so the NYT dataset significantly contributes to the LLM’s training and capability.
What’s being laid out is the extent to which OpenAI’s models owe some of their accuracy to the NYT’s data. This supports the NYT’s case for royalties vs. a flat fee or subscription. They will likely argue that the contribution should correspond to the revenue share. In talking to colleagues who are lawyers, the NYT is setting up a rebuttal to any OpenAI claims that it attempted to negotiate an agreement in good faith.
The NYT understands that it must support its valuation vs. agreements already signed with other data providers. If the NYT establishes its valuation in court, OpenAI (and every other foundational model provider) will be on the hook for much higher data contract rates. I wonder if that’s what the NYT is genuinely after.
Today, a small group of foundational model providers is setting the price below fair market value because they treat data like a digital asset when it’s not. The NYT stands to gain a significant sum but also is advocating for every other content provider and setting a new standard. The NYT is attempting to tie the dataset’s value to the accuracy improvement it creates in the model.
It’s a version of the with-and-without valuation method. Foundational model performance is benchmarked before adding a dataset (without). It is retrained or rebuilt with the dataset, and a new benchmark is taken (with). The difference between the benchmarks (with vs. without) determines the share of royalties due to the data provider.
There are several proposed approaches to data valuation. Here are 2 articles if you want to go deeper into them. Article 1. Article 2.