Data Engineering For Analytics And Machine Learning Use Cases: Migrating From Data Catalogs To Ontologies And Knowledge Graphs
Ontologies are complex data models. Why do you care? Enterprise data is structured for BI but not analytics and data science use cases. As a result, most data cannot be used for analytics or data science. Data engineering is a cost center until data engineering matures from data catalogs to ontologies for knowledge management. I explain more in this article.
In a previous article, I explained the technical drivers for making the change, introduced OWL, and showed some of what Generative AI can do to accelerate the ontology engineering process. This article fills in details for many of the concepts I introduced and referenced. For readability, I’ve broken this into 2 parts.
I cover 8 companies and explain what each one is doing with ontologies to support data science and get more value from their enterprise data.
Each is an insider’s look at the use cases and justifications for migrating from simpler data models to ontologies. This section frames ontologies, and the rest of the article makes much more sense with this context.
I’ll provide a more detailed explanation of ontologies by discussing how they are built today. This section will introduce key concepts.
Terms Lists, Thesauri, and Authority Files
Nodes, Edges, and Vertices
I wrap up this article with examples of how LLMs can create these top-level ontology components with basic or semantic prompts.
Part II -
I get into ontology engineering and explain prominent methodologies.
Knowledge Organization Strategies
Building Knowledge Representations For Enterprise Data
Data Mapping and Integration (AKA Data Harmonization)
Ontology Standards and Testing
Seeing a finished ontology is a complex way to begin learning about them. It’s easier to come at it from the engineering and modeling process. You’ll learn by watching me build, at first, small, then more complete ontologies.
I extend ontology engineering methods by introducing the tools landscape. There’s a growing ecosystem that supports ontology engineering, testing, and maintenance.
Finally, I will introduce LLMs to the development process. Generative AI can help reduce the cost and overhead of ontology engineering. I will introduce the key concepts and continue to dive deeper into them in the next post in this series.
Companies That Have Already Made The Switch And Why
Is this real, and why would any business undertake the transition from data catalogs to ontologies? The best way to answer that question is with tangible examples of companies working with ontologies and taxonomies. The healthcare, biomedical, pharmaceutical, retail/eCommerce, legal, and social media industries are the most common users of ontologies. However, they are becoming the standard for data and AI maturity across industries.
GSK is building the Onyx Research Data Platform and developing capabilities to support new drug discovery. It’s part of a larger effort to create a best-in-class data science team focused on predictive models and use cases. Engineering ontologies is one branch of this effort and supports everything from basic analytics and reporting to advanced machine learning models.
Tesla’s annotation team uses an ontology to support Autopilot model training. They leverage a mature process that begins with identifying scenarios where the model fails to meet reliability requirements. The annotation team is challenged to find ways to improve model reliability with data vs. architectural changes. That involves data discovery, curation, and more advanced labeling frameworks. All three benefit from an ontology.
Capital One’s domain ontology is an emerging program to integrate semantic technologies into the business’s products and services. One of the business’s top-level goals is to optimize existing data models. The program is combined with machine learning and data governance.
What drove Capital One, and many businesses launching a similar program, to explore this line of data modeling is data discoverability or search. As data sets increase in size, accessing domain knowledge gets more complex. Traditional data models become an impediment, and that’s when businesses turn to ontologies.
JPMorgan Chase is at an earlier stage in the journey. Its data architecture and governance teams are working to transition from financial industry standard to custom-built ontologies. Most of the team’s work is focused on organization, classification, and standardization.
ADP is developing an ontology to support its new, customer-facing virtual assistant. The company’s vast customer engagement and interaction data is being structured to train a virtual assistant. They aren’t just building to train LLMs. ADP’s Ontologists are using LLMs to help develop knowledge structures, one of the areas I will cover in depth.
Etsy uses ontologies to model knowledge about its inventory, buyers, shops, listings, collections, and more. One of Etsy’s primary goals is personalizing the shopping experience. Its business spans a complex marketplace. That drives the move from traditional to more advanced data models and knowledge management strategies. The same pattern repeats in every example—complexity and scale force businesses to embrace ontologies.
Bloomberg leverages ontologies for content management. It’s no coincidence they were an early entrant into the LLM space deploying a financial Generative AI model and virtual assistant.
Lennar is implementing an enterprise data fabric, and developing ontologies is part of that initiative. They are also in the early stages of ontology engineering. The goal is to connect data from multiple sources, many of which are siloed. Lennar is worth mentioning because they are jumping on a growing data 360 or analytics 360 trend. The goal is for any user to have access to the data they need when they need it, no matter where the data lives. Uniting data from multiple systems and sources drives a lot of interest in ontology engineering and modeling.
How Ontologies Are Built Today – Core Concepts
The top-level concepts are controlled vocabularies, taxonomies, and the ontology itself. Each one plays a part in managing enterprise data. These terms are often mashed together, and people use them interchangeably. Some differences are worth understanding.
The simplest concept is a controlled vocabulary. It’s the first component of an ontology that must be defined.