Data Engineering Is Evolving, But Most Data Engineers Aren’t
Defining The Need And Frameworks For Generative Data Engineering
Most data can’t be used for analytics and training machine learning models. I have delivered that slice of bad news to every client for the last 8 years, and it’s my most popular seminar’s primary topic. The assumption is that all data is useful, and that’s only partially accurate. We think of data quality and utility through a digital prism, and that’s held data engineering back for over a decade.
Most data has utility for digital use cases, but only data with context has utility for data and AI use cases. This article will explain how data is leveraged in those two use case categories. I begin with nontechnical explanations and get deeper into the technical implementation in the following articles. Look for part two tomorrow.
Getting deep and technical is no exaggeration. The complete picture pulls from multiple fields, and it’s asking a lot for one role or individual to build it all. Implementing this is a team sport, not a solo initiative. If I see all these capabilities in a single job description, let’s just say I won’t be pleased. This series isn’t meant to define a new unicorn data engineer.
This may not seem like an extension of the Generative AI In Production Series, but it is. Everyone says data is still critical for Generative AI, but that’s only true if the data is properly structured. We talk about integrating data, analytics, and machine learning into every aspect of work, but rising data engineering costs translate into the unit economics not working for most applications.
A growing body of research supports the thesis that data quality and structure impact model reliability and generalization more than the quantity of data or model complexity. Data quality is intrinsically linked to the structure built around data as it is gathered. In the past, that was also infeasibly expensive.
Then Generative AI took a leap forward with GPT-3.5, Claude 2, and the open-source ecosystem of LLMs. I jumped on causal discovery early on, and that’s been the focus of a parallel series. I’m wrapping that in as well. The third post in this series will explain how it fits and how to implement the new approach with Generative AI.
In part four, I will explain more technical concepts. Drift is expensive to manage with our current model training and data engineering approach. It’s one reason so many use cases are infeasible. Instead of handling drift as part of data engineering, it’s pushed to the end of the lifecycle and treated as a maintenance or MLOps function. Pulling it forward reduces costs and improves model reliability.
Generative AI will be one of the most impactful tools we’ve ever seen for data engineers. What I’ll teach you to build in this series is a cheat code for deep learning models. Switching data engineering from a digital to an analytics and data science-centric approach will enable you to build more reliable models faster and with less data.
From a business standpoint, why should you care? I have helped clients reduce the costs of data governance and engineering by half. Now you see why this is my most popular seminar. Businesses that adopt a data and AI-centric approach to data engineering have faster delivery times and more opportunities to automate the data lifecycle/workflow. The data sets this approach curates are of higher value than those developed with the digital approach. How?
How can we keep our careers relevant with all the changes happening in the data science field and because of the products that are coming online?
The technical skills and domain knowledge we rely on become replaceable and commoditized. I needed a new category of capabilities to break through the plateau.
I teach Data and AI Strategist Certification and Data and AI Product Manager courses, and the next cohorts start soon. Enroll today to acquire the strategic capabilities to move your career forward in a rapidly changing professional landscape.
BI, Analytics, And Data Science Have Different Use Cases And Data Needs
I start client engagements with an initial assessment, and one artifact from that process is a data monetization catalog. It connects data sets and models to use cases. The data monetization catalog is my starting point for changing the assumption that all data is valuable. The process starts by reframing data as a novel asset class.
Just because data is used doesn’t mean that the data is a value creator. Usernames and passwords are common enterprise data types that support login functionality. For a login popup or screen, the digital functionality is the primary value creator, not the username and password. The data has digital utility and value.
Those data points have no value for data and AI use cases because they lack context. In the data monetization catalog, username and password are connected to login use cases but have no value. However, adding context to the usernames and passwords will change that.