Is Data Still A Competitive Advantage?
Early bird pricing on my instructor-led AI Product Management Certification ends Friday. Act now to take advantage of the discount. Subscriber office hours this Wednesday will be an hour later than usual, 9 am PT. I’m attending the launch of SAP’s next-generation data and AI platform. You can join the live stream too. The link remains the same. I hope to see you all there.
Scott DeGeest (Principal Data Scientist at Interos and part of my current Data and AI Product Management Certification cohort) created a detailed rundown of the case for and against data as a competitive advantage. He asked me to respond to the points on both sides and explain my position relative to them.
The question and treatment of the subject are so good I want to share the answer with this community. My answer addresses Scott’s main points and the possibilities he considered as potential answers to his questions.
In my self-paced courses, I emphasize that questions are often more valuable than the content itself. This is an excellent example of why. Questions allow me to contextualize and personalize the frameworks to specific examples. Students show mastery of the concepts when they begin to synthesize the frameworks to more significant concepts like this one.
Here’s his original breakdown, or skip the italics to get straight into my response.
The core of the question I’m working on right now: When and how is data generation a source of competitive advantage?
“Data is a source of competitive advantage” (Pro argument):
We’ve been talking about how data generation is a potential source of competitive advantage, and I’ve seen similar arguments that because AI models are easy enough to replicate, data is the sole source of competitive advantage in the current paradigm.
“Data isn’t a reliable source of competitive advantage) (Con argument):
And yet, I’m open to the argument and point that data moats are not really all that useful in generating such a competitive advantage.
https://creativeventures.vc/2021/01/14/the-fall-of-data-moats/
https://a16z.com/the-empty-promise-of-data-moats/
I don’t yet believe in a contradiction here – if one can contain multitudes, then so can one’s ideas about data.
So, I’m trying to think about context here and or parse more subtle structures of meaning.
Possibilities I am considering here:
Though both arguments use the term “data,” each means something different.
There’s a difference between data acquisition and data mobilization.
From that idea—the difference between having data and using it—I would probably map this idea to your capabilities maturity.
Maybe something like: “Data acquisition puts an org at L1 or L2 maturity, and data mobilization puts an org in late L2 or L3 maturity.”
Though both arguments use the term “competitive advantage,” each means something different.
Similar to the point you made about the end of sustainable competitive advantage, the time frame of how long access to a dataset as an asset is a competitive advantage is shorter than it used to be.
The idea of a data capability, or the ability to leverage new datasets as they become useful, is something that provides a longer-term advantage.
Both articles I reference frame ideas around network effects as the primary source of competitive advantage and say data isn’t something that in and of itself produces network effects.
I also think this issue ties back to my thoughts about the difference between data generation and context generation from data.
One of the things I say to data scientists I train is that there’s always a software engineer better than you at data algorithms, so your superpower must be generating context from data concatenation.
Another way to say that is “data scientists have to be good at getting disparate data sets to talk to each other to deliver information.”
So, from there, I can reframe the point here in terms of your data generation maturity model.
A good SWE is going to know more about Level 0 of generating a lot of data and storing it efficiently.
A good DS will know how to move an organization to L2+ by concatenating the right data sets to produce new information.
Though both arguments have merit, the rate of technological change has obviated these circa 2021 arguments about data moats.
This claim is the hardest for me to accept.
The idea here would be that the rapid development of technology in the space has changed the way data creates value.
Another thing I’m thinking about here is that the difference between training data for an LLM and data used in a trained LLM is a different use case, and the technological shift of being able to acquire pre-trained models much more easily has changed the value of one’s unique datasets.
I could frame this idea around your arc of disruption for opportunity discovery – what in 2021 with data moats had a “no, it’s not feasible” answer has now changed to a “yes, it’s feasible” in 2024.
While I’m not sure whether I have any of this framing right, I did want to take more of an aggressive stab at applying your models to a question I’m trying to sort through that is relevant to me currently.
Working Through The Counter Cases
Creative Ventures' primary point is, “Thanks to better representations and algorithms, it’s now possible to do a lot more with a lot less data.” There have been some changes since CV wrote this opinion. The Generative AI wave is one of them. In the article, CV says, “Outside of a limited number of fairly specialized applications, data moats are becoming less and less secure.”
Generative AI breaks the specialized applications part of this argument. Data is a critical component of large and small language models. LLMs need massive, high-quality data volumes, and SLMs need high-quality, comprehensive, domain-specific datasets. The core thesis is broken in both directions.
CV believed that synthetic data from models and simulations would be the primary source of training data. While they’re right, building simulations requires substantial work and data. LLMs are making some interesting inroads into high-quality synthetic dataset generation. Again, data lives at the heart of both.
The Andreessen Horowitz article makes a strong point that’s still relevant today. “The cost of adding unique data to your corpus may actually go up while the value of incremental data goes down!” They use the example of the Netflix content recommender, which shows how data creates a moat.
Netflix has access to customer content consumption data at a very low cost. It happens on the platform as part of the standard customer workflow, so the incremental cost of gathering data is fixed. The cost of gathering data about increasingly granular customer segments and content consumption behaviors doesn’t rise over time.
Increasingly accurate recommendations for even niche customer segments is one way that Netflix can retain those customers better than a streaming competitor that can’t. Data has a flywheel effect and, over time, creates a moat. This is the power of first-party data. When workflows (either internal operations or external customer) happen on the platform, data is generated natively, and the cost of data gathering doesn’t increase over time.
So why is AH’s point still relevant? The cost of acquiring data from second and third parties increases as that data becomes more niche or covers events that happen infrequently. Fewer companies have the data, and they sell it at a premium. Businesses without direct access to the data-generating process have higher input costs than companies with access.