The Thin Line Between DS And BS

Dec 01, 2022

The motto of most data science teams should be, 'You can't be doing it wrong if nobody knows what you're doing.' Pull back the covers on most models, and you'll find deeply flawed methods. But no one in the business knows enough to call it out.

This is where data science gets dangerous. Few people are data or model literate. They don't understand:

What data science should look like: Rigorous methods and foundationally sound approaches.
How to evaluate model performance: Connecting business metrics to model metrics.
How to articulate their reliability requirements: How stable the model must be to meet their needs.

Data literacy training is part of the problem. The focus is training people to be data analysts in 6-8 weeks. That's impossible, but former data scientists and analysts are willing to sacrifice credibility to make money in the training space.

Businesspeople leave the training with a false sense of competence. They apply worst practices to self-service analytics. Nearly all are missing the required knowledge to evaluate models the data science team delivers.

The business isn't blameless. Leadership decides to move forward without a strategy in place. When the data contradicts their strongly held beliefs, they refuse to act on it. If data uncovers problems, leaders often ask data scientists to alter it.

Data Science Has Lost Its Way

The data science team can claim whatever they want. They ship models to production and tell the business how to measure success. They are their own QA and QC. When something goes wrong, it's an MLOps problem, not a model reliability problem.

In reality, MLOps is a code word for a dysfunctional model. The MLOps framework would be much slimmer if our models were built correctly. We've come to accept some ridiculous notions as best practices.

When users don't adopt, it's because they don't get it. It's a common reframe in transformation. The users resist technology because they aren't technical enough or are set in their ways. Most data products are non-functional. People don't adopt what doesn't work.

Data scientists should know better, and many do. However, put one competent data scientist on an otherwise incompetent team, and their voice gets drowned out. It's unlikely they'll be hired in the first place because their answers to interview questions would be considered wrong.

That story is repeated by hard sciences PhDs who transition out of academia. They must wade through several nonsensical interviews before finding a team that understands their value. My favorite story comes from a data scientist who entered the field after completing a Ph.D. in physics. They had a director say, "You don't understand modeling."

Peak hubris is thinking that our field is at the leading edge of modeling. Data science is guilty of reinventing and taking credit for established principles in other fields. The combination of research methods, modeling, and machine learning is powerful. We have advanced data science's utility by partnering with other fields. Still, many data scientists hold to disproven methods.

Is It All Data Scientists?

It's frustrating to call the entire field out because a significant group has adopted rigorous methods. There are exceptional data teams. The problem is that businesses don't know the difference.

It's equally frustrating to call out all bootcamps because some teach rigorous methods. There are exceptional online programs. The problem is that students don't know the difference. The same is true for data literacy training.

I don't enjoy calling out data leadership. Again, there are exceptional leaders, but most don't have the data science or leadership background required to do the job. I have seen CDOs with less than 5 years in data science and little leadership experience. There are other CDOs with significant leadership experience but no data science background. Few from either group can succeed in the role they have been promoted or hired into.

Leadership plays a large role in dysfunctional hiring processes. Job descriptions don't make sense, and teams aren't structured properly. The data science lifecycles and workflows are neither defined nor standardized. The definition of each role is arbitrary. Interviews are not standardized, and assessment criteria are subjective.

Leadership should improve the team, remove roadblocks, facilitate resources, translate business goals, and build relationships with external teams. Without leadership experience, training, and mentorship, few technical leaders know those are critical functions of the role.

Ask data scientists if they like their jobs. One group tells stories of being overworked because they are asked to do the work of a data scientist, data engineer, ML engineer, and MLOps engineer. Another group talks about being bored and doing almost 0 data science. A tiny segment is happy in their roles and doing the work they expected to.

Businesses don't know the distinction between roles or how to build data teams based on business needs. External leadership can't step in and implement solutions to problems they don't know exist. However, they also play a prominent role in this mess.

The Business Has Lost Their Minds

I called Bob Chapek, the recently ousted CEO of Disney, one of the most data-driven leaders in the business world. His abrupt dismissal from Disney came after he asked the CFO to manipulate financial data. It's an all too common practice. People are only as data-driven as their circumstances allow. Bob knew how hard the stock would tumble if the true extent of Disney+'s losses were revealed, so he moved some costs into a different business unit to cover it up.

Data scientists are constantly forced into this position and don't have the same clout as the CFO. In Disney's case, the CFO went to the board, and Chapek was booted on a Sunday. He got fired on his day off, revealing the seriousness of manipulating data.

Most data manipulation requests go unreported. As self-service solutions become more pervasive, goofing the data to support whatever people want it to will become just as pervasive. People will take the same liberties data science teams have as soon as they get the chance.

Their motivations are no different than Chapek's. If the data makes them look bad and they are staring down the consequences, they'll take the easy way out. These problems are foundational as well.

Most business operations and the metrics used to evaluate their effectiveness are flawed. Businesses refuse to address problems when the culture punishes failure vs. rewarding improvement. Data has become a massive elephant in the room.

Snap and Uber paint an early picture of what's coming. They were at the leading edge of the current layoff cycle. Productivity rose after their cuts. The average revenue per employee has increased, indicating that both businesses were overstaffed.

Most companies have a similar problem. Although tech is in focus right now, this is a common theme. Middle management has bloated over the last 10 years, causing inefficiencies to justify their jobs. Requests to alter data have their roots in self-preservation. The people in charge also lack oversight, but that's changing. Data gives C-level leaders visibility into productivity and efficiency. It's getting difficult for low performers to hide.

When processes are mapped and analyzed, the elephant shows up. Businesses are very inefficient. When the focus is on growth at all costs, no one pays attention. The language C-level executives use has shifted from growth to productivity and efficiency.

Data scientists have a lot of power to expose inefficiency and threaten low-performing fiefdoms. Data teams drive automation initiatives which are the execution side of productivity and efficiency. There's minimal oversight, and we have control over measuring success.

Change Is Coming

The harsh reality is the first change wave. Competition is increasing, and inefficient businesses will go under. Companies perpetuating this cycle won't survive the next 12-24 months. I don't wish that on anyone, from CEOs to investors to employees. I've never been part of a company that failed, but I have worked with many people who were. They spent years working through the aftermath, both professionally and personally.

Audits and oversight are the next change wave. As soon as the business sees revenue growth slow or reverse, every dollar spent gets put under a microscope. Auditors show up, and technical SMEs are new additions to outside consulting firms. Finding the skeletons hiding in the data team's closet doesn't take long. Data organizational leadership changes, and people connected with failed projects are let go.

Data teams don't get downsized. From my experience, cuts are implemented with the intent to hire new data professionals. We won't see a decline in demand but will see a shift in job requirements. Companies are hiring more data engineers and data analysts. Data scientists are expected to have applied research capabilities, and those who don't are migrating into data analyst, data engineering, or ML engineering roles. MLOps is becoming more upfront quality vs. production containment focused.

It's early, but there's a trend breaking. When companies need to cut costs and focus on the bottom line, a CFO replaces the CEO. Companies that need to improve operations and execution bring in a COO. Businesses that are struggling with transformation will look for technical leadership. Watch for CTOs, CIOs, and some CDOs to filter into CEO roles in traditional companies. This trend has already started in technology-first businesses and will spread to non-tech industries.

What Should We Do About It?

We must do self-evaluations and decide which role our capabilities and experience fit. While the business doesn't understand the delineations, we do. It's up to us to get our titles to reflect reality.

For many of us, that will mean changing labels. Most data scientists are more accurately classified as data analysts, data engineers, or machine learning engineers. Some data analysts are doing data science and should make a case for a title change that reflects their responsibilities.

Some data scientists are business, customer, or process-facing. They need to be reclassified as data product managers and project managers.

Applied machine learning researchers and pure researchers need the same honest self-assessment. Data scientists don't all build for business applications or do publishable research. I went to the applied research side because my work was rigorous enough for business, but I would get laughed out of academia. The opposite is true for pure researchers. Both have value but finding a company that knows how to monetize our capabilities is critical to our career's success.

It's time to face reality. The titles and responsibilities that fit aren't always the ones that fit our perceptions. We're in a business cycle where getting ahead of the inevitable change is better than what happens if we practice a comfortable self-deception.

I'll use myself as an example. Am I still an applied ML researcher? Nope. Can I do the job? Yes, but others have more current applied knowledge of new approaches. Am I still an ML engineer? No. Can I still write production-grade code? Yes, but others have a deeper familiarity with the frameworks and more recent experience with best practices.

Hopefully, you see where I am going. I could fake my way through several roles, but that doesn't change my capabilities or value to a business. I am a technical strategist because that is what I am most qualified to do.

The rest of my capabilities aren't current enough to meet the standard for practitioners. Even if most teams don't know the difference, I do. It's time to hold ourselves and those around us accountable for professional standards. Otherwise, we will lead our companies over a cliff and be no better than those who ran FTX or any other techno-scam.