Knowledge Graphs – Structure Versus Meaning
We are at an inflection point in the process of designing and building computerized systems. Let’s be honest – even with the explosion of new data technologies we are still building the same unit record processing systems from the 60’s, although with a bit more sophistication than an 80-column punch card. You could take most relational database designs, dump each table to a tape, hang them all on tape drives and get about the same results. This seems harsh (and maybe a bit of an exaggeration) but our systems are as dumb as a box of rocks in terms of the meaning of the data and focus mostly on reading and writing records and presenting data to the user on a screen.
When we started using relational databases, we adopted the Entity-Relationship Modeling approach to design. We were going to design databases based on what the data means with tables representing real world objects rather than cramming as many fields into one record as possible. Chris Date talked about “what, not how” – meaning that SQL should specify what data you want, not how programmatically the database should retrieve it. These were exciting times.
Entity Relationship Modeling
Database design adopted Entity-Relationship Modeling methodologies and used a normalization design process that stressed reducing data redundancy while increasing the flexibility and ease of data access. If the data is normalized we can produce any output the user desires! While this is true and works well, it has focused the design discussions on table “structures” with data meaning pushed to the side.
Data models are traditionally divided into 3 “layers” – conceptual, logical, and physical:
- Conceptual: Conceptual models in theory were meant to describe the entities in business terms (i.e. what the data means) but in practice the conceptual model was defined more as a limited scope logical model with such directives as “you don’t need to resolve the many to many relationships” or “don’t include technical attributes”. Rarely has an organization produced and sustained a useful conceptual model of an application database. I won’t even mention “enterprise conceptual” models.
- Logical: The logical model is meant to expand on the conceptual model, filling in needed keys, constraints and attributes while maintaining third normal form and remaining database neutral. This is a useful exercise but typically is used only as a pre-cursor to what everyone really wants – the physical design. Organizations are moderately successful in maintaining logical models.
- Physical: The physical model is the database dependent implementation of the logical model with some denormalizations introduced as needed. Relational databases have improved so much in terms of performance, that many database designs are straight implementations of the logical.
While I believe that today’s application system database designs are a great improvement over flat file systems (I was just kidding about that tape drive remark), I am still disappointed that this technology hasn’t delivered on the greater promise of our databases being a model of reality. Object Oriented programming made the exact same claim (design objects to represent real world things) but also failed to deliver. Complex hierarchies of classes are really about program structure (i.e. Model-View-Controller) with data classes relegated to a persistence framework. End users can’t make sense of a UML class diagram any more than they can an ERD.
The reality is that ERD’s and class diagrams are far more about structure than meaning.
Emerging Semantic Approach
Back in 2006 I delivered a presentation at the DAMA Symposium entitled “Bridging The Gap – Adding Semantic Awareness To Today’s Application Systems”. You can find the presentation here: https://singerlinks.com/presentations/
In this presentation, I noted several technologies that seemed to be merging towards a more semantically aware computing stack:
- Enterprise Content Management (ECM) servers with emphasis on taxonomy/thesaurus to manage the vocabularies used to search the content
- Semantic Web servers built around the W3C standards based on logic and ontology
What surprised me at the time was how little each of these worlds was aware of each other within the organization. Furthermore, traditional N-Tier application and database developers were completely unaware of either ECM or Semantics and I’m not sure any of this has changed in the intervening 11 years.
What we are seeing now is a revolution in AI where some very difficult problems are gaining traction with real world implementable solutions (voice and image recognition, natural language processing, machine learning).
When Google introduced Knowledge Graphs they coined the phrase “Things not Strings”. This simple three-word phrase captures the essence of both the problem and the solution.
- Problem: (strings) We store data as bits conforming to some datatype organized into a structure.
- Solution: (things) We need to store and process data like humans do – as a network of inter-related concepts.
This signifies a shift in computing away from managing structures of data towards managing data based on its meaning.
Meaning not Structure
I would like to expand this thought (things not strings) to “Meaning not Structure” and I believe this transition will be the driving force behind IT for decades to come. Data modeling will no longer be about how to organize strings into normalized structures, but how to define things in terms of their classifications, categorizations, descriptive properties and most importantly their relationships to other things. I highly recommend Thomas Frisendal’s book “Graph Data Modeling for NoSQL and SQL” which goes into detail on the history of data modeling and the transition to meaning versus structure modeling.
All this begs the question of what exactly is “meaning” and how can software “know what something means”. The answer to these questions is coming from the relatively new field of Cognitive Science.
Cognitive science is the interdisciplinary, scientific study of the mind and its processes. It examines the nature, the tasks, and the functions of cognition. Cognitive scientists study intelligence and behavior, with a focus on how nervous systems represent, process, and transform information.
Cognitive science seeks to determine how humans understand the world around them and communicate this understanding with others. This is a “renaissance” that is combining scientific research from a number of fields:
- a behavioral understanding of how we think and communicate
- a biological understanding of how our brain and body functions
- a technical understanding of how language encodes knowledge
Our goal, and the IT transition that is just now beginning, is to organize data in the computer the way humans organize data in their minds, and to create software that mimic’s a person’s ability to reason about, communicate about, and act upon that understanding. After all, millions of years of evolution can’t be wrong. Why shouldn’t we build systems that mimic the way we work.
Knowledge Graphs represent a starting point for practitioners interested in this new approach. The knowledge graph uses a simple linguistic model (subject – verb – object) to represent propositions about the world. We can easily model the world in terms of the “things” we are interested in and how they inter-relate. Using a basic ETL approach, we can load data from source systems into a graph database like Neo4j and build a simple linguistic model of the facts that interest us. These graphs can act as a “map” to the underlying detail used to create them, providing faceted search and guided queries. Stay tuned to this blog as we get into some “how-to” specifics on building Knowledge Graphs.
This article originally appeared at DATAVERSITY on October 18th, 2017