AI Won’t Be Worth Much Until We Build Products For People VS Tasks

Dec 16, 2022

The data science field is starting to build products. When I began working on machine learning products in 2012, there were only a few examples of them to learn from. There were successful pilot projects and case studies but few explorations of data products.

It was hard enough at that time just to build the model. The most significant challenge became my primary focus, so I put all my time and effort into making the model do a task. Accuracy was the measure of success.

However, I quickly realized accuracy is only one part of the picture. The model also had to be reliable. I came up with model reliability out of necessity. I could spend months working to grind incremental accuracy improvements out of a model, so the question became, when is the model ready to be put into production? Most users were non-data scientists who didn’t understand model accuracy metrics, so I needed to come up with a different way for them to answer that question.

Reliability was the first time I began to incorporate the concept of a user and their needs into my model design and development. That shift started me down the rabbit hole of understanding the paradigm of human-machine teaming. People don't work with models the same way they do with other digital products. I have talked about this extensively in other posts, but it significantly impacts how we build and the directions we need to pivot the field into.

Focusing on accuracy has an interesting impact on how we build data science products. It makes our objectives very task-centric. We create a model to deliver inference and power some software applications. We think about the solution, doing a task in isolation, and that is where we start to go down the wrong road.

We think we're replacing a person, but that's not the case. We may be doing something for a person. However, the person is still part of the workflow. They will be the ones using the application and the model to accomplish some goal, and that goal is theirs. Focusing on the task ignores the reality of the person.

Predicting The Weather And How People React To Predictions

Weather forecasting is one of the few examples of a model-supported product that existed back in 2012. The focus was on building the most accurate weather model possible. Weather models have a very short time horizon for feedback. If the model says it's going to rain tomorrow and it doesn’t, we have feedback that it is inaccurate.

The more often the weather model is wrong about rain or sunshine, the more likely we are to abandon it and build something new. That feedback loop between people and models is absolutely critical. With the weather model, it's simple to look outside and say, “Well, the weather model is right, or the weather model was wrong.” There's another data set being built that we don't often acknowledge.

When the weather model is wrong, the way people react based on its inference also changes. If evacuation warnings were constantly issued to people in a city about a hurricane that never came, they would begin to ignore those warnings. This is a very simplistic behavioral feedback loop. It gets a lot more complicated from here, and understanding how people react to inference served by models helps us understand how to design model-supported products.

Hurricane forecasting is an excellent example. Today, when weather channels forecast a hurricane, they provide a cone. The cone represents the most likely region which the eye will hit. The cone gets wider and wider the further into the future that the model looks. It's an effective visualization to help people who aren't necessarily model literate understand how best to use the prediction that they're being served.

Even the name is informative. It’s called the ‘cone of uncertainty.’ Forecasters are immediately calling out the nature of forecasting. There is significant uncertainty, and the prediction will probably change over time. Those are powerful trust-building aspects of weather model design that have nothing to do with the task of predicting a hurricane’s path.

Small user-centric changes can have significant impacts on how people use and how much trust they have for model-supported products. It took several iterations before forecasters figured out that this cone was the most efficient way of showing people data.

Making Data Science More User Centric

Machine learning will be more successful when we build models to complete a task with people instead of alone. In the weather forecast example, an accurate model is only one part of the solution. The predictions interact with people. Consider 2 questions:

What is the weather like right now?
What will the weather be like tomorrow?

Asking for a weather report is an example of human-machine tooling. I can look outside and figure it out for myself, but it’s more convenient for me to get an aggregate report from a tool. The report is based on sensors delivering real-time data and is very accurate. I have a good understanding of what’s being measured and how.

Asking for a weather forecast is an example of human-machine teaming. I can’t look outside and make the prediction myself, so I rely on the model to provide an accurate forecast. I don’t understand all the data that goes into the forecast or the process used to generate it. I am surrendering some autonomy, the part of my workflow I can’t do myself, over to the model.

If the forecast is presented the same way as the report, it hides the reality that a report and a forecast are very different. The user could be naïve to the differences, and the solution must be implemented with that in mind. ML-based products fail to be adopted because we build for model autonomy or human-machine tooling vs. human-machine teaming.

ChatGPT is a good example. It was built to answer questions alone, not answer questions with people. When I ask ChatGPT a question, I have an objective, and the solution isn’t designed to capture my objective. ChatGPT needs the ability to ask follow-up questions to understand my needs.

Right now, ChatGPT delivers an answer for the average person vs. the person asking the question. Consider the differences in how we are expected to answer questions. We are taught to think about the person asking us a question before answering. Giving a highly technical answer to a non-technical person is frowned upon.

We ask people to read the room and understand their audience, but ChatGPT wasn’t designed with that ability. It was built with technical functionality in mind but not human utility. We see this anti-pattern repeated in most data products.

Developing The Platform To Better Understand Users

Applied machine learning research teams need Data Product Managers and Model User Design Engineers. We need people to study the differences and define human-machine teaming design principles. To do that, we need more user engagement experiments.

Are you interesting in learning Data Product Management and Strategy? I am offering 2 cohorts starting January 7th and 9th. Learn more here.

This part of product engineering is widely overlooked because we have spent so much effort improving our models. If I’m being honest, data science wasn’t ready for real-world applications until about 2017. Before then, descriptive models dominated successful implementations. The applied machine learning research I worked on before then was costly. Only the largest companies had problems at a sufficient scale to justify the investment.

Today we have the tools, infrastructure, experience, and model architectures to make machine learning more practical. Model reliability isn’t the biggest challenge. Adoption and utility are the next engineering challenges we should take on.

Data Product Managers and Model User Design Engineers work together to facilitate engagement experiments. Some of these are simple A/B tests, while others are complex behavioral studies like the ones that uncovered how to best present hurricane predictions. The ability to run experiments should be built into the solution.

Platforms are data-gathering powerhouses. People interacting with the platform create rich datasets. Companies like Adobe have sophisticated data gathering and customer behavioral tracking capabilities built into their platforms. They use these datasets to understand the most important features for each customer segment.

We can discover points of friction that prevent users from completing their workflows. I put tracking in place that evaluates workflow abandonment rates. If there are 7 steps in a defined workflow, how often does a user stop before reaching the last step? How often do they return and attempt to complete the workflow?

We can discover a loss of trust or interest over time. I track changes in the number of workflows completed by each user, adjusted for things like seasonality.

I can run experiments on training effectiveness. How much faster do users who recently attended training complete their workflows? How does training impact platform usage and workflow completion? Training is an intervention, and I have learned a lot about preparing users to work with models from these types of experiments.

Data Product Managers working with Model User Design Engineers can integrate experimental hooks into platforms. Data PMs are critical members of the experimental design and review team. User and business impacts should always be assessed before approving an experiment.

Building An Ecosystem On The Platform

Companies like OpenAI will get high-value datasets from building a user ecosystem on their growing platform. They have incredible research and early adopter communities but need a DevRel approach to develop their user community.

Like user design and product management, the ML version of a DevRel requires a different set of capabilities. Only a few unicorns have successfully built an ecosystem around true ML products where people pay money for inference.

Products like DALL-E2 and ChatGPT are released to the public with a few use cases in mind. The public’s usage and imagination surface additional use cases. There’s a new product management paradigm at play.

Large models rarely have just one or two applications. There are typically dozens or even hundreds. Data Product Managers are challenged to think them up independently, especially when they involve customers in new markets. Opportunity size must be assessed, which requires even greater access to customers.

Open releases are an intelligent strategy, but they don’t succeed without a large, engaged community. That’s where the AIRel Engineer or Evangelist adds value. They manage all the activities required to build an ecosystem around the platform. They give Data PMs access to the community and users. AIRel Engineers are constantly surfacing opportunities and needs from their interactions with the ecosystem. They are educators who prepare people to adopt and use the platform.

Data science needs to add roles focusing on how people want to interact with models and what the new human-machine teaming paradigm means for product design. We need roles that focus on educating and preparing users for model-supported products.

These new needs are a sign that data science is maturing into an applied field. The technology works well, and it's ready for users. Models don't fit into the digital paradigm, and we'll fail if we try to force it. We need to prioritize applications research.

Our field should spend more time thinking about how to prepare people for the range of possible solutions instead of hoping they'll figure it out themselves.