Article

Knowledge Graphs in Engineering: A New Perspective

Posted August 10, 2022 | Technology | Amplify
KG_Engineering
In this issue:

   AMPLIFY  VOL. 35, NO. 7
  
ABSTRACT

Michael Eiden, Philippe Monnot, and Armand Rotaru illustrate several prominent, real-world KG applications, then detail how they designed a KG to ensure vertical traceability within a systems engineering context. They developed an ML model that consumed features derived from the KG and mimicked the way an independent safety assessment auditor would work in practice. Using precision and recall to evaluate the model’s accuracy resulted in finding previously incorrectly labeled software requirement specifications. They also found that combining graph-based features with text-based ones boosts the classification accuracy significantly, thus showing significant promise in augmenting human safety assessors in the future. They end the article with some specific advice on using KGs, including unlocking new insights, extracting more from the data you have, and starting small with the intention of scaling quickly.

 

In recent years, closely related terms “artificial intelligence” (AI) and “machine learning” (ML) have become staples of corporate jargon. As management consultants, we have noticed that many customers have an incorrect or incomplete understanding of these buzzwords, including when and how to apply the concepts and recognizing their inherent limitations. Such behavior is an expected consequence of the accelerating adoption and integration of data-driven approaches to business processes.

As discussed in our Amplify article last year, several key factors drive the success or failure of an ML project.1 Having access to quality data in sufficient quantity is critical, but this aspect is commonly overlooked/underestimated by the decision makers leading these endeavors. Such misfocus is due to the significant hype around ML algorithms. The press (and technical literature) invariably present notable achievements of new, sophisticated algorithms without describing the data that powers the algorithms to allow them to reach unprecedented levels of performance. Consequently, for most companies wanting to venture into the world of AI, this misplaced focus means that building a team of talented individuals to work on developing fancy new algorithms for solving unique business problems will be costly and unlikely to render a positive ROI.

A more effective and productive option is to focus on formulating the problem correctly, building the appropriate infrastructure that allows you to gather informative and unbiased data, and using state-of-the-art algorithms.

Historically, the default option for storing and retrieving data has been relational databases, which represent data in a tabular format. Recently, however, many companies have begun migrating to knowledge graphs (KGs), an alternative solution for representing and querying data.

The good news in this shift? Building a graph is as easy as connecting dots with lines.

Graphs in a Nutshell

The geometric nature of graphs makes them intuitively accessible. In their simplest form, graph theory describes them as “networks of dots and lines”2 — meaning they can be intuitively represented with drawings. Most of us have drawn a graph at least once in our lives. Who among us has never written ideas on a whiteboard or piece of paper and then connected them together (if you haven’t, you’ve probably at least watched TV detectives do that to catch a criminal)?

Leaving behind the formalities and strict language of graph theory, a graph is composed of nodes (dots) and relationships (lines) that connect them. Practically speaking, nodes represent entities: things or concepts that can be described by a set of properties.

Let’s jump right in with a simple example of a labeled graph, the most common type. Figure 1 shows various entities in the context of an organizational diagram. Employees, managers, and the company are shown as circles (nodes).

Figure 1. Information about company employees and departments, represented in KG format (source: Arthur D. Little)
Figure 1. Information about company employees and departments, represented in KG format (source: Arthur D. Little)

Each node contains one or more properties. For example, the bottom-right node has an Employee label, with Paul and Male as properties that describe it. Natalie is also an Employee, as well as a Manager and a Female. Note that nodes can have multiple labels. Relationships between nodes are named and directed, meaning that they have a start node and an end node. Eva, an Employee, has Natalie as her Manager. Therefore, a REPORTS_TO relationship links them to each other. Paul and Eva work together as a binome on projects. Thus, two IS_PAIRED_WITH relationships connect them (indicated by a double arrow). Finally, relationships can also have properties.

Figure 2 shows how the organization graph might be structured if defined in a relational database. One can appreciate how information is duplicated and more difficult to visualize when compared to a graph representation.

Figure 2. Organizational diagram (equivalent to Figure 1) represented through a relational database containing three tables — top-left table contains the name of entities and their properties; top-right table contains the role of each entity; bottom-right table contains how employees are related to each other (source: Arthur D. Little)
Figure 2. Organizational diagram (equivalent to Figure 1) represented through a relational database containing three tables — top-left table contains the name of entities and their properties; top-right table contains the role of each entity; bottom-right table contains how employees are related to each other (source: Arthur D. Little)

Graphs in Daily Life

Given their generality, graphs (or networks) have a variety of applications, ranging from modeling the progression of Alzheimer’s disease to finding the optimal route between two cities. Table 1 contains a short selection of some of the most well-known applications.3

Table 1. Prominent, real-world applications of graphs (source: Arthur D. Little)
Table 1. Prominent, real-world applications of graphs (source: Arthur D. Little)

Recommendation systems, an example in Table 1, are a frequently encountered application of graphs. Such systems track user behavior (e.g., products the user bought or films the user watched) to predict new content the user might be interested in. From a graph perspective, this means inferring relationships of the type “X might be interested in Y,” where X is a user (e.g., Jane Doe), and Y is a piece of content (e.g., Star Wars). To do so, the recommendation system looks at similarities between users and between pieces of content, measured in terms of shared neighbors (i.e., nodes connected to a given node) and relationship types. For instance, if users Alice and Bob have similar behaviors (e.g., they both watched The Matrix, Dark City, and The Crow), and Alice watched Equilibrium (but Bob did not), then the engine should infer that Bob might want to watch Equilibrium. Following the same logic, if Equilibrium is in the same genre/has similar aesthetics to The Matrix/Dark City/The Crow, and Alice watched The Matrix/Dark City/The Crow (but not Equilibrium), the system should infer that Alice might want to watch Equilibrium.

Assessing Safety Through Graphs

As graphs gain in popularity, novel applications in traditional business settings are more likely to come up. Thus, it is crucial to make graphs accessible and part of the standard toolset for AI practitioners. In the remainder of this article, we show how we’ve used graphs to power a recommendation system that supports independent safety assessments (ISAs) of safety-critical systems.

In highly regulated industries, safety is always at the top of the agenda. Industries like aerospace, railway, and nuclear have had dark track records when it comes to in-service failures, and these failures often lead to significant injuries and/or loss of life. The complexity and cost of the systems, combined with their notoriously long development cycles, make them error-prone, so even small errors can have big consequences. To help mitigate these risks, NASA developed a methodology called “systems engineering,” which has been widely adopted:

Systems engineering … focuses on defining customer needs and required functionality early in the development cycle, documenting requirements, and then proceeding with design synthesis and system validation …. Systems engineering considers both the business and the technical needs of all customers with the goal of providing a quality product that meets the user needs.4

When applied early on and at the right level, systems engineering can significantly limit cost overruns by reducing the odds of making ill-formed decisions throughout development. At each stage of this iterative methodology, commonly represented as a V-shaped lifecycle (see Figure 3), key artifacts are systematically generated to ensure traceability, design rationale, documentation, and verification.

Figure 3. A systems engineering V-model that represents a systems development lifecycle — on the left side and starting at the top, customer requirements are captured and the design is defined with more and more granularity as we progress down the V; on the right side, going up the V, the system is tested at a component, subsystem, and system level to ensure the as-built system is compliant with the as-designed system while meeting the initial customer requirements (source: Arthur D. Little)
Figure 3. A systems engineering V-model that represents a systems development lifecycle — on the left side and starting at the top, customer requirements are captured and the design is defined with more and more granularity as we progress down the V; on the right side, going up the V, the system is tested at a component, subsystem, and system level to ensure the as-built system is compliant with the as-designed system while meeting the initial customer requirements (source: Arthur D. Little)

Regulatory bodies make ISAs mandatory, with the intention to inspect and review internal processes (e.g., variations of the systems engineering methodology) and the outputs of those processes (e.g., system and software specification, safety analysis, verification activities, and testing evidence).

In official terms, an ISA is defined as “the formation of a judgment, separate and independent from any system design, and development, that the safety requirements for the system are appropriate … and that the system satisfies those safety requirements.”5 The ISA therefore targets safety-critical systems (software and/or hardware) by auditing the documentation (i.e., artifacts), with the aim of assessing safety, robustness, and completeness.

Limitations

The inherent complexity of the systems targeted by ISAs means the development and safety demonstration usually relies on a large amount of documentation. The entirety of the documentation supplied can rarely be reviewed in the context of an ISA audit. Therefore, auditors — usually domain experts — manually perform their assessment by randomly sampling the artifacts to gain sufficient confidence in their quality. 

Depending on the initial outcome, the auditor might continue to sample the artifacts or follow his or her experience/intuition and target some specific ones. The inherent nature of ISAs, and the context in which an ISA is performed, mean that total confidence cannot be realistically expected as an outcome. A residual safety risk always remains present. In Arthur D. Little’s (ADL’s) Digital Problem Solving (DPS) practice, we have used KGs and AI to reduce this residual risk and demonstrate how the technology successfully augments traditional ISA approaches.

Use Case: Vertical Traceability Analysis

DPS partnered with ADL’s Risk practice to run a proof-of-concept in parallel to a live ISA audit, aimed at a railway signaling system undergoing a major overhaul. The use case was limited to a single aspect of the auditing process: ensuring the vertical traceability between software requirement specifications (SRSs) and software component specifications (SCSs).

Vertical traceability analysis aims to analyze the various levels of specification of a system. Specifications, depending at which level they sit, can be vague and general (e.g., a customer requirement) or specific and detailed (e.g., a specific behavior that a component must follow). Figure 4 provides a simple specification tree for a generic software system.

Figure 4. Generic specification tree for a software system (source: Arthur D. Little)
Figure 4. Generic specification tree for a software system (source: Arthur D. Little)

The vertical traceability between two layers of specifications is ensured if all three key criteria are met: correctness, completeness, and acceptable refinement (see Table 2). Note that these criteria must be validated both ways — down and up the specification tree.

Table 2. Vertical traceability criteria (source: Arthur D. Little)
Table 2. Vertical traceability criteria (source: Arthur D. Little)

The objective was to accurately predict whether or not the vertical traceability analysis of a given SRS would be flagged as a PASS or FAIL by a human ISA auditor. A raised FAIL would mean the ISA auditor believes there’s a potential safety issue with a given SRS. The artifacts specific to the live case were provided by the team that had recently completed the ISA. They also provided the outcome of their vertical traceability analysis: whether each SRS was a PASS or FAIL. Out of the 199 SRSs provided, the ISA team flagged 46 as FAIL and 153 as PASS.

Methodology

Predicting a binary outcome (PASS or FAIL) for each specification was the key objective of this use case. Framing the problem in this manner made it a great candidate for supervised ML. In ML jargon, this would be referred to as a “classification-type” problem. Readers with some exposure to ML will recognize the approach shown in Figure 5, used to develop the model for the task at hand, going from raw data to ISA-specific insights. It shouldn’t come as a surprise that Step 2, graphical representation, was added to the typical data engineering and modeling pipeline. Let’s now dive into each step of the methodology.

Figure 5. Methodology followed during the use case: from the original artifacts to predicting the outcome of a vertical traceability analysis (source: Arthur D. Little)
Figure 5. Methodology followed during the use case: from the original artifacts to predicting the outcome of a vertical traceability analysis (source: Arthur D. Little)

The sections below present an overview of each step while expanding on steps where the graph plays a differentiating part.

Artifact Ingestion & Extraction

As the reader might expect, the documentation received was not stored in a well-structured, queryable database. Instead, it comprised a blend of PDFs, Word documents, spreadsheets, embedded images, and embedded formulas. There were 20,000 individual files. Although this is not uncommon, additional effort was required before any modeling could begin: data had to be hosted, staged, and processed. With the help of natural language processing (NLP) and domain expertise, the processing was done programmatically by only extracting data relevant to the use case. This included all SRSs and SCSs in addition to related specifications, their descriptions, their context, and how they related to each other.

Graphical Representation

The next step is to define an ontology specific to the use case. Ontologies are data models that define what type of entities exist in the domain of interest, the set of properties that describe them, and the relationships that link them.6

Creating an ontology is usually time-consuming and requires in-depth domain expertise. In this case, the ontology was implicitly defined and documented through the supplier’s artifacts and development process, which closely follows the systems engineering standard approach/terminology. Figure 6 shows a snapshot of the ontology employed in the use case.

Figure 6. Partial ontology used for the use case, where specifications are entities, and relationships are based on where along the specification tree they sit; CRS, SSS, and SCTS are other entities linked to the specifications, namely SRS and SCS; they provide information on the wider context around which both entities sit (source: Arthur D. Little)
Figure 6. Partial ontology used for the use case, where specifications are entities, and relationships are based on where along the specification tree they sit; CRS, SSS, and SCTS are other entities linked to the specifications, namely SRS and SCS; they provide information on the wider context around which both entities sit (source: Arthur D. Little)

Specifications such as SRS and SCS represent entities. The properties attached include their unique identifier (UID) description and UID description label. The label refers to the vertical traceability analysis outcome provided by the ISA team (PASS or FAIL). Other entities are also defined to provide the wider context in which the specifications sit. Using the ontology as a template, data previously ingested and extracted was used to construct a graph representing the use case domain. Figure 7 shows a portion of that graph, with only nodes and relationships related to SRSs (blue circles) and SCSs (orange circles) displayed.

Figure 7. Sub-portion of the complete graph build to model the use case; only SRSs (blue nodes) and SCSs (orange nodes) are represented (source: Arthur D. Little)
Figure 7. Sub-portion of the complete graph build to model the use case; only SRSs (blue nodes) and SCSs (orange nodes) are represented (source: Arthur D. Little)

By representing the problem as a graph, patterns and clusters quickly appear. For example, there seems to be an agglomeration of interconnected SRSs and SCSs in the middle of Figure 7. However, a high number of satellite groups disconnected from the central aggregate are also present around the edge of the graph. Adding more entities and relationships to the graph helped reveal insights into the interdependencies between entities, essentially revealing the inner workings of the audited system.

Feature Engineering & Model Development

Feature engineering is one of the most critical steps of the process because it’s responsible for providing the model with informative features that help it accurately and precisely perform the task. The traditional way to approach the problem would be to try to generate semantic features by assessing whether Table 2 criteria are respected, as an ISA auditor would do. These features are referred to as text-based features in Table 3.

Table 3. Main features derived from the data using NLP and graph algorithms (source: Arthur D. Little)

Representing the problem as a graph lets us instead extract features that describe the intrinsic architecture and interdependency of the data. As shown in Table 3, such features were generated using standard graph algorithms, namely community and centrality measures.7 A well-known community measurement algorithm is called PageRank, named after Larry Page, cofounder of Google. This algorithm lets the Google search engine rank Web pages that are returned to the user by the search engine.8

The model development process, which also incorporates feature engineering, follows the iterative process shown in Figure 8. Most of the time, features are removed, tweaked, or created based on the performance achieved and desired. The outcome is a trained, tested ML model that classifies the vertical traceablity analysis outcome of SRSs, as an ISA auditor would (PASS or FAIL).

Figure 8. Iterative model development methodology followed to train and test an ML model that classifies the vertical traceability outcome of SRSs, as an ISA auditor would (PASS or FAIL) (source: Arthur D. Little)
Figure 8. Iterative model development methodology followed to train and test an ML model that classifies the vertical traceability outcome of SRSs, as an ISA auditor would (PASS or FAIL) (source: Arthur D. Little)

Results

The model’s performance was evaluated using standard metrics: precision and recall. Precision is the model’s accuracy at flagging FAILED SRSs; recall is the model’s accuracy at recognizing true FAILED SRSs. Combining the graph-based features with the text-based features (i.e., features solely inspired by the vertical traceability criteria from Table 2) gave the best performing model, boosting both precision and recall by approximately 10% in comparison to using only text-based predictors.

Six of 14 additional SRSs were found to be incorrectly flagged by the ISA team: four being wrongly flagged as FAIL and two being wrongly flagged as PASS. Such findings demonstrate the model’s ability to uncover safety failures not immediately obvious to the ISA team. These results show how powerful graphs can be, when used to represent highly interconnected data, and how to extract informative features from them. Additional methods, such as graph embeddings, can also be used to derive features from the graph’s architecture.9

During a live ISA, both the graph and the model can be directly employed by the auditor to help him or her effectively perform the audit. First, the ML model would provide an ISA auditor with a prioritized list of potential safety issues. The highest-ranked issues would have the highest probability (as assessed by the model) of being actual FAILS and should be quickly investigated by the auditor. The model is not replacing the auditor; it provides a non-random, principled way to sample artifacts for analysis, making the best use of the auditor’s time on potentially safety-critical issues.

The auditor could also interact directly with the graph through a user-friendly interface to explore the audited artifact and get additional insights. The visual, accessible nature of graphs makes them great mediums for exploration. One could also imagine the ML model being updated on the fly as the auditor progresses his or her audits, or by leveraging patterns uncovered by the auditor.

Finally, putting in place a good ontology meant that expanding the use case was easy, and the graph could easily be extended to accommodate new nodes/relationships. This also meant the data feeding the ML developed for the use case would not be affected by the graph scaling. Indeed, sub-portions of the graph can easily be isolated through simple queries, making it virtually isolated from the complete graph.

What to Expect From Graphs

The world’s giant tech companies jumped on the “graph train” a while ago and now power some of the best-known tools, platforms, and services through graphs: the World Wide Web, social media, Web stores, and search engines. However, this does not mean companies should immediately start replacing all relational databases with graphs. New AI technologies tend to be initially seen as miracle solutions that will solve most problems (e.g., deep learning).

When deciding whether to use only graphs, only relational databases, or a combination, make sure to ask some key questions. For example, how important is rapid data exploration? How crucial is the speed at which data can be added to, and/or retrieved from, the data store? If graphs still come out as highly viable candidates, here are a few advantages you can expect from using them in your next data-driven project.

Visualize Your Data to Unlock New Insights

Because graphs are so easy to visualize, it takes little effort to find all the information associated with a node and the direct/indirect relations that link two nodes. This property of KGs both simplifies data exploration and provides richer insights into it. The multiple dimensions of the KG can easily be explored by slicing it across one or more dimensions. In the use case, visualizing the SRS-SCS architecture (see Figure 6) led to a key hypothesis of the problem: the way specifications are linked and clustered together is closely related to the vertical traceability analysis outcome. An SRS connected to a failed SRS through nearby elements is more likely to be flagged as a FAIL.

Extract More From Your Data

The inherent interdependencies hidden in your data can be brought to light and leveraged by running algorithms on your graph. As seen in the use case, they can be used to compute several metrics on the whole graph or a sub-portion that can help make sense of your connected data and their inner workings. These metrics can then be used as features to power an ML model.

Start Small, Scale Fast

You might initially decide to build a graph that models a small portion of your domain space. That’s fine. Nothing prevents you from later expanding it to answer new questions or because more data becomes available.

Within graphs, it is easy to add a new type of node property or relationship. That is, the new property/relationship can be applied to a (potentially small) subset of nodes. If you have many node properties and/or relationships that apply only locally, KGs will be both much smaller and faster to process than their corresponding relational databases. Multiple graphs can also be combined if they share or have related entities, limiting high rearchitecting costs and enabling you to quickly grow your solution.

Our parting thought: if the world’s big data is a mountain of dots, knowledge graphs will help you connect them all.

References

Papadopoulos, Michael, and Philippe Monnot. “Why Do Machine Leaning Analytics Projects Fail?Cutter Business Technology Journal (renamed Amplify), Vol. 34, No. 6, 2021.

Trudeau, Richard J. Introduction to Graph Theory. Dover Publications, 1993.

Barabási, Albert-László. Network Science. Cambridge University Press, 2016.

Walden, David D., et al. INCOSE Systems Engineering Handbook: A Guide for System Life Cycle Processes and Activities. 4th edition. Wiley, 2015.

What Is Independent Safety Assessment (ISA)?” ISA Working Group, 2011.

Robinson, Ian, Jim Webber, and Emil Eifrem. Graph Databases: New Opportunities for Connected Data. O’Reilly Media, 2015.

Needham, Mark, and Amy E. Hodler. Graph Algorithms: Practical Examples in Apache Spark & Neo4j. O’Reilly Media, 2019.

Needham and Hodler (see 7).

Needham and Hodler (see 7).

About The Author
Michael Eiden
Michael Eiden is a Cutter Expert, Partner and Global Head of AI & ML at Arthur D. Little (ADL), and a member of the ADL’s AMP open consulting network. Dr. Eiden is an expert in machine learning (ML) and artificial intelligence (AI) with more than 15 years’ experience across different industrial sectors. He has designed, implemented, and productionized ML/AI solutions for applications in medical diagnostics, pharma, biodefense, and consumer… Read More
Philippe Monnot
Philippe Monnot is a Cutter Expert, a Data Scientist with Arthur D. Little's (ADL's) UK Digital Problem Solving practice, and a member of ADL's AMP open consulting network. He’s passionate about solving complex challenges that impact people’s livelihood through the use of data, statistics, and machine learning (ML). Mr. Monnot enjoys developing accessible solutions that customers will adopt through effective data storytelling and explainable… Read More
Armand Rotaru
Armand Rotaru is an AI/ML data scientist with Arthur D. Little’s (ADL’s) Digital Problem Solving (DPS) practice and has been involved in a variety of projects that have a natural language processing (NLP) component, predominantly in the petrochemical, transportation, and biomedical sectors. He is also responsible for maintaining/expanding the NLP section of ADL’s DPS Training Portal and mentoring junior team members. Mr. Rotaru has a master of… Read More