Enterprise Knowledge Management: Introduction To Ontology Engineering For Machine Learning Use Cases
In part 1, I opened the door with the basic concepts that explain what an ontology is and how it supports the transition from enterprise data management to enterprise knowledge management. This article covers the ontology engineering process and my approach to incrementally developing them. The tools landscape is also part of this article.
I intended to write a post about standard ontology engineering but realized how few of those processes apply to data engineers and scientists. We need an applied process, and learning best practices for a different field doesn’t make sense. I will explain the ontology development process using a case study from a project I worked on.
As I explain the process, two main points will emerge. First, some ontologies have causal structures. I call ontologies cheat codes for machine learning because, once discovered, those causal structures can be used to build highly reliable models and explain the inference they serve. Second, some ontologies change over time. In data science, we call this drift. This article covers a mostly static ontology. The next in the series will introduce dynamic ontologies with a different use case.
The payoff for all this is a more mature data engineering framework that manages knowledge instead of data. Knowledge engineering and management have direct ties to ROI. Managing aggregated knowledge graphs vs. sprawling data sets also costs less.
The first section is a giant asterisk about defining knowledge in the scope of data engineering or enterprise data management vs. defining knowledge more rigorously. Applied data science gets away with practices that will not fly anywhere else. It’s critical to call these out so we move forward with our assumptions laid out.
The next sections introduce ontology engineering basics with a supplier discovery case study:
Knowledge Organization Strategies
Building Knowledge Representations For Enterprise Data
Data Mapping and Integration (AKA Data Harmonization)
Causal Structures
I cover tools and standards immediately after. This section has links to more detailed information and tutorials. I intend to provide resources to frame where these tools belong in the ontology engineering process and why they are necessary.
The last section introduces dynamic ontologies and sets up the next article in the series, which explains dynamic ontologies in greater depth and introduces LLMs into the development process.
Defining Knowledge – The Elephant I Want You To Ignore
Nothing that comes next is as simple as I will present it, and I would be negligent to ignore the vast body of work these concepts rest on. Ontologies walk the line between science and a deeper exploration of knowledge and the nature of reality. Epistemology is a rabbit hole where Descartes and Plato live. And I will step over it like most data scientists do.
I begin and end this article with epistemology. I’m about to define a process for managing knowledge without defining what knowledge is in the first place. Pay no attention to that elephant staring us down while I continue. At the end of the article, I will return to epistemology and confront two applications we must engineer solutions for, the dynamic nature of some ontologies and reconciling conflicting ontologies.
For this article, knowledge is constrained by business context, which is the domain expertise required to operate a business. Think in terms of workflows and the expertise required to complete those workflows and deliver high-quality work products. Workflows have two sides, the tasks and the decisions made as part of those tasks.
Mostly static ontologies have few decisions. When decision-making enters a workflow, the rate of change or dynamism increases dramatically. This use case has very low dynamism, so I can focus on the fundamentals of ontology engineering. At the same time, you’ll get a sense of how valuable small, static ontologies are.
Organizing Knowledge Based On Simple Workflows
In skilled trades like CNC machining, people are taught dexterity, software, and steps to deliver a work product. There’s often little independent judgment or decision-making involved in each workflow to ensure a high-quality, consistent work product. For these workflows, domain knowledge is defined by the CNC machine’s operations and the steps necessary to deliver a work product.
One of my earliest machine learning projects involved an ontology that connected machines to the parts they could manufacture. The ontology was developed after several attempts using more traditional machine learning approaches. The ontology engineering approach made the product feasible when traditional data engineering couldn’t.
The image below from this paper shows what looks a lot like a process flow.
The legend calls out the reference vocabularies (controlled vocabularies) used to build this lightweight ontology. Products have parts or modules, and modules have components. Each component requires one or more activities to develop, and the recursive arrow indicates the potential for multiple activities. Activities require machines, and there must be a station with that resource to support each activity.
As you can see in the bottom left corner, if the person has the skill necessary to operate the resource or machine, they can complete the activity. The activity is a black box in this ontology with respect to what the person does while operating the resource. There’s an undocumented assumption that if the person can operate the station, they can complete the activity. In manufacturing, that is almost universally true.
Connecting Ontologies With Enterprise Data
Each entity or class is a workflow concept that should correspond with data. The products table should have fields/lists/arrays named ‘modules’ and ‘components.’ At a supplier business (most are very immature and don’t keep this sort of data, so please suspend disbelief with me), the shopfloor table would have fields/lists/arrays named ‘stations’ and ‘persons.’
The controlled vocabulary for this workflow domain maps to enterprise objects or entities. There’s an overlap with HR workflows in the ‘person’ entity. HR would use the term ‘employee’ for its entity, and there would be a connection to the ‘person’ entity. However, they would be different.
All ‘employees’ are ‘persons’ (for now), but ‘employees’ have HR-related attributes, while ‘persons’ have people-related attributes like ‘skills.’ The ‘skills’ entity indicates an overlap between the shopfloor and HR. The ‘skills’ used to complete the activity are the same ‘skills’ HR uses in job descriptions.
The enterprise has many long-chain workflows and rarely realizes them. Ontologies make those connections much more apparent. In the HR to shopfloor chain, hiring or recruiting bottlenecks will lead to a bottleneck in production output. This knowledge could help HR decide to offer a signing bonus to prevent the supplier from missing a contractual deadline.
Enterprise planning benefits from long-chain workflows and dependencies. Following the ontology’s graph structure makes the connections clearer. It can support automated triggers or recommendations to help HR (upstream process) see the impacts of lengthy time to hire on the shopfloor’s ability to deliver on time. CxOs have visibility into the ROI for hiring bonuses or funding hiring process improvement initiatives.
The ontology should contain edges between enterprise entities and the locations they are stored at. Databases, tables, and fields should be entities defined by an ontology just like any other. Applications and code can be mapped in an ontology. Once you begin down the ontology path, you’ll find applications everywhere. Domain knowledge must be stored and accessible.
An Ontology Engineering Example From Supplier Discovery
In my early machine learning project, the primary innovation (not mine) was that multiple activity flows could produce the same component. Companies traditionally searched for suppliers that could handle one activity flow because they did not track alternatives.
Some alternative activity flows required different machines and could be handled by suppliers that hadn’t been considered before. In other cases, the alternate activity flows were less expensive or took less time. Alternate activity flows were only created when they were absolutely necessary or suggested by suppliers. It was a labor-intensive process.
A model was developed to accept a CAD drawing and return the activity flows that could produce the component. In traditional data engineering, the data set is curated in advance. Data scientists move forward after it’s built and may need to return to the data engineering team for additional data. The data engineers’ work isn’t seen as part of the finished product, and they’re often labeled as a cost center.
With ontology engineering, data engineering work is connected to ROI. Developing the ontology is part of the data science life cycle and product development processes. Without the ontology, and I’ll explain this further a bit later, the model development process would be much more expensive.
The activity generation model supported supplier discovery and supply chain resilience. Another model estimated the cost of each activity flow and chose the least expensive production process.
Developing Models From Ontologies
Where did the activity generation model fit on the earlier image? The relationship between components and activities contains more information than the simple label. The workflow for translating the component’s CAD drawing into the activities necessary to manufacture it is hidden behind it. The activity generation model transformed the CAD drawing into data points and, based on the materials, determined what machine(s) could deliver the component.
Building this model requires data, and the ontology helps us decide what data is relevant to this problem space. The model must learn the relationship between a component and the activities required to manufacture that component. Based on the ontology, I know components and activities are named entities. Activities can be more granularly defined by the entities: skills, stations, field devices, resources, and people.
I need more granular definitions because the named entity ‘activity’ alone does not contain the causal knowledge about what activities could create the component. I need a more granular understanding of the activity to make novel connections between components and activities that are outside the current component and activity data set.
The same applies to components, and the ontology doesn’t represent this knowledge. CAD drawings contain more granular features that hold the causal knowledge about the component. We need another ontology, and this paper has one.
The image describes the boundary representations (BREP) ontology that converts CAD drawings into data points. BREP lets us capture the component’s geometry and topology based on its CAD drawing. The geometric entities in red/pink are numerical values, and the topological entities in blue order them into a hierarchical representation. The data points in BREP format define the causal information about the part.
With these ontologies, we can answer the questions:
What causes or defines a component?
What causes or defines an activity?
The hypothesis behind the supplier discovery model’s innovation is there is some overlap between what causes the component and what causes the activity. The resource is related to the component’s geometry and topology. There are two ways we could map or model the relationship.
And Ontologies From Models
A resource can be defined in BREP by the topological and geometric ranges it can produce and the compounds it can work with. However, that data set would take up a ridiculous amount of space. Do we have enough data about activities and components to learn each resource’s capabilities? Probably not, and even if we did, there’s a smarter approach.
What causes the resource’s capabilities? Its specifications. Multiple ontologies can capture the domain knowledge about the resource’s capabilities. In most cases, the efficient ontology for a domain represents the causal features and structure.
Using traditional data engineering, the data science team would probably define the resource backward. Although a resource can be defined by the range of components it can produce, it should be defined by the specifications that cause what the resource can produce. The alternate approach is easier to see from an ontology view than a traditional data engineering view.
The difference between the two is direct measurement (resource specifications) vs. indirect measurement (component geometry and topology ranges). Data scientists often measure a system through observation (or data gathered from observation) of how the system we care about interacts with another system. In this case, we attempted to understand resources by measuring the interaction between resources, activities, components, and CAD drawings.
With the ontology view, the process changes. We have ontologies for activities, components, and CAD drawings but not for resources. Step 1 should be to define our ontologies completely. The ontology indicates what data is missing and is useful for developing our model.
If you want to explore a resource ontology, this paper provides a good one. For the purposes of this article, I’m wrapping up the use case here.
Ontology Engineering And Knowledge Organization Strategy
The example illustrates a knowledge organization strategy through an ontology engineering exercise. There are many ways to approach this problem, and no agreed upon “right way” or “perfect strategy.” Ontology engineering begins with a domain. In my approach, I limited the scope of the domain based on a problem space. It’s equally efficient to limit the domain to a workflow.
After defining the domain, a controlled vocabulary must be chosen or developed. In my example, the controlled vocabularies are prebuilt by industry and academic organizations. They are relatively well-agreed upon standards. If a vocabulary doesn’t exist, we can consult domain experts to document entities or classes and name them with the most agreed upon labels.
The controlled vocabulary forms the basis for the ontology engineering process. In the first image, the product ontology is defined in a limited way to answer a question. How are products made? Obviously, there is a lot more domain knowledge required to describe a product completely. However, for this workflow, the rest is irrelevant.
Constraining the domain with a workflow or problem space makes this a feasible process. Defining a complete knowledge or business unit’s domain is too big. Connecting development to workflows or problem spaces focuses ontology engineering on the domain knowledge required to run the business. It breaks the effort down into smaller chunks developed to support a product.
This strategy limits the ontology to systems and processes that could generate data for the business to gather. Not all workflows currently generate data. Some should, and others never will. In my book, ‘From Data to Profit,’ I call workflows that don’t generate data ‘opaque,’ and most of the business fits this description.
Leadership will insist they have high visibility into the business but drill down, and the reality is much different. Most domain knowledge required to operate the business is opaque to everyone except a handful of people responsible for those workflows. In the past, this was business as usual. Data science changes that, and the business’s highest-value workflows should be transparent.
Process mapping or discovery is a digital transformation activity that data engineers can benefit from. Process mapping should include entity or class identification and definition phases. These frameworks make moving forward with ontology engineering easier when an initiative eventually targets the workflow. It’s not a requirement, but efficiencies and cost savings can be realized if the business is already working on process mapping.
The Tools Landscape For Knowledge Graphs And Ontologies
We need tools to develop, store, query, and test ontologies. Each link here goes to the most detailed information available for the tool. Most tutorials dive deeper into how the tools leverage or manage ontologies. Some are direct links to the product page.
Graph databases are the most common way to store ontologies. This list isn’t an endorsement of any one solution. These are the most commonly used based on job descriptions and requirements.
Cosmos DB (Azure)
Triplestores. Example: Databricks + Stardog.
There are higher-level tools that handle more of the ontology engineering lifecycle. If you get the sense that these products are just coming into their own, you’re right. Few enterprise solutions comprehensively support ontology development. This tools landscape is in the early days, even though many tools have been around for a very long time. These are all links to product pages.
Protégé This is an open-source project maintained by Stanford. It’s ubiquitous.
Just what you needed, another query language. SPARQL is a standard tool for working with graphs and, specifically, ontologies. These are the structural standards for representing ontologies.
RDF and RDFS
SKOS (Controlled Vocabularies)
SHACL (Conditional Graph Validation)
When Ontologies Change
In the supply chain example, the ontologies are largely static. New products, modules, and components will create new CAD drawings and data, but BREP will support them without changes. New resources will be built, and data sets will be updated with their specifications, but the top-level ontology will support them without changes. The model that translates resource specifications to CAD drawings in BREP will also be stable.
Ontologies contain causal structures; typically, models built with them are as stable as the ontology. Not every ontology is linked to a real-world system that’s as stable as this one. Social media networks and marketplaces are much more dynamic systems with more complex ontologies.
I chose manufacturing because it is well-defined to support complex supply chains that span multiple businesses and countries. The shared ontology enables data to flow between diverse points in the supply chain. Its order is built from necessity, and businesses are experiencing the same need for structure across multiple domains. It turns out that facilitating data flows across business units is very similar to a supply chain.
When large numbers of people interact, it’s rarely as well-ordered. I was asked at an HRTech conference how the industry could ensure that bias didn’t enter the hiring process through the use of AI. My response was, “Give me a definition of bias and get 5 people around you to agree with you. I’ll implement it.”
Dynamic ontologies and conflicting ontologies defy causal structure. In the 70s, gender norms were enforced in hiring, but perceptions were changing. Men slowly entered nursing, and more women entered corporate leadership. Even today, if I say, “I had a bad experience with my nurse during a blood test,” my parents will respond, “What did she do?” If I say, “I had a bad experience with my doctor at my physical,” they would assume the doctor was male.
There’s no malice. For that generation, the ontology with rigid gender roles is still the default. My parents don’t intend to offend anyone and have no problem with female doctors or male nurses.
In sales, former athletes are preferred by many companies because they seem to do better in the role. I don’t see a problem with that bias, but look deeper, and there are more male college athletes than female college athletes. Data science prefers graduates from computer science and hard sciences degree programs. It sounds rational, but here again, there are fewer female graduates than males.
Bias surrounds us, and some biases are the foundations of high-performing heuristics. When is a bias negative, and when does a bias become domain knowledge? Epistemology has been attempting to answer questions about the nature of reality and knowledge for thousands of years. The digital, information, and intelligent ages make that even more complex.
What’s Next?
In the next post, I will explain more complex ontology engineering processes to account for the elephant I ignored. I introduced causal structures and will continue connecting ontologies to causal graphs. It’s a more significant challenge when viewed through the lens of dynamic ontologies and multiple, conflicting ontologies.
These cases are where LLMs can help the most. Dynamism is represented in large bodies of text, images, audio, and video. People cannot process all that data alone. We need a translation and automation layer to help detect potential changes as they happen and respond in near real-time.