Updates: I am launching a reading club for subscribers. During Monday and Friday office hours, I will give a short presentation to extend one of the week’s articles and take questions. I will discuss this article on Monday, and I hope to see you there.
I have released the next cohort of my Data and AI Product Management course. The first class is on Saturday, July 15th at 8 am. Each class is 90 minutes, with an hour for Q&A after to personalize the content to your situation. The cohort runs for 6 weeks, and each session is recorded for you to refer to. Enroll here. I hold a few seats back for subscribers, so email me if the class shows sold out, and I will enroll you.
Causal Discovery. Where To Start?
People who dive head-first into causal inference, machine learning, and discovery are disconnected from their purpose. The sciences introduce causal methods to support experimental design and validation. There’s a purpose behind all the acronyms and math.
The experiment itself creates a deeper connection to the physical world. Data can easily become a barrier, obscuring the physical world that generates it. In data science, causal methods are taught as a means to generate more accurate models. The purpose is tied to data and not the physical world.
I’ll start this article and series by introducing a framework from my data and AI product management course. I understand how strange that starting point must sound, but the product focus makes a connection to the physical world. Whether we’re talking about data, models, or causal methods, it must all connect to something physical, or it lacks purpose and context.
To build a causal model, you must understand why you’re building it in the first place. A disconnect from physical systems is fatal for causal models. Once researchers lose sight of what is being modeled, their efforts are futile. The connection to a physical system (the thing being modeled and the motivation for the causal model) is crucial.
Connecting Data And Models To The Real World
A model is a twin to something in the real world. Model accuracy improves as it becomes an increasingly accurate representation of the original physical twin. Gravity is a good example of a physical force that we have some successful models for.
When researchers propose experiments to understand gravity better, there’s little resistance. The value of understanding gravity is well-known and accepted. It explains how planets move, black holes behave, and satellites get into and stay in orbit.
There are similar forces in the business world. There’s little resistance to understanding customer behavior or the supply chain better. Both have accepted and well-understood business value. It’s worth reframing initiatives for the rest of the business so they see an immediate connection to something they care about.
Physicists at CERN don’t discuss how their work furthers our understanding of quantum chromodynamics. They explain how the experiments produce data we’ve never been able to gather before. That’s why we need the billion-euro collider, massive compute infrastructure, and a small army of scientists. That data will help us understand atoms and fundamental particles better. We don’t have an easier way to create the data set, so these experiments and the tools that facilitate them are essential.
No one would put over a billion euros into understanding quantum chromodynamics but reframe the outcome to understanding the nature of fundamental particles and the money spigot opens. We need a framework to translate data science into business value. It turns out that data science also requires a framework to translate business systems into data science.
Enterprise Data Science Is A Multi-Body Problem
The business won’t fund data gathering for causal discovery with LLMs but will put resources into better understanding the supply chain. The data team cannot successfully create a causal model without the connection to the business system being modeled. We can’t even succeed at data gathering until we have a framework that helps us identify the data we should be gathering.
No causal model survives without a connection to the real-world twin it is built to simulate. No experiment can produce data that leads to a causal model without a basic understanding of what is known and unknown about the system. We can’t develop a basic understanding of the system without data generated by the system.
A disconnect between the data-gathering processes and data-generating systems leads to data sets that cannot be used to build reliable models. If we don’t have metadata tracking back to the system that generated the data, we don’t understand what the data can be used for. That changes data engineering and requires a new approach to data set curation.
The challenge is always in aligning multiple objectives, which is why I call this a multi-body problem. In Physics, the multi-body problem circles back to gravity. The Earth, Moon, and Sun each exert strong gravitational forces on each other. Understanding gravity helps us to know how they work together. It’s the glue that helps explain a lot of their behavior.
We need a single framework to align the multiple bodies in a business, and I use the data maturity model. While it’s a data and AI product management framework, it serves multiple purposes. Let’s be honest. This article is worthless if you can’t put any of it into practice.
The business must fund data gathering, or this doesn’t work. The data being gathered must have a well-defined connection to a business system, or this doesn’t work. The links between value creation, business systems, data gathering, and causal discovery are all necessary. That’s why I’m starting with this framework.