X

Best Practices from Oracle Development's A‑Team

Relational to RDF Road Trip - Part 1

John Featherly
Cloud Native Architect

A Tale of Two or more Relationships

Knowledge

So you, thought you, might like to create a knowledge graph. Why? To feel the warm thrill of confusion? That Space Cadet glow? Maybe, likely, you are in pursuit of knowledge. Looking to collect, relate and represent it in order to make it more accessible to humans and accessible period, to computers.

The Trip Plan

You’ve been hearing about this great travel destination called knowledge representation. You have data scattered all over, growing rapidly and increasingly difficult to make sense of. Is the growing pile of data useful? Is it meaningful? On this road trip we will be looking at data and semantic information contained in and around relational database systems. Similar excursions are possible with data held in non-relational form. Enroute to our destination, a knowledge graph, we’ll see tools and techniques to architect and build an RDF graph and how semantic information makes the journey from obscurity to accessibility.

Leaving the Driveway

Our premise for the trip is a relational database system with two pieces of table data that look like:

1 Joe Montana 1
2 Joe Namath 2

 

1 Forty Niners San Francisco
2 Jets New York

 

This is a symbolic representation of table data absent any information about how it is encoded and represented inside any specific database management system. There is no knowledge represented, only data. Consumers of the data will potentially project meaning onto it of various degrees and bias depending on the context in which it is consumed. A skeptical, unbiased consumer is left to wonder what the creator meant by the contents in the table cells. “What did you mean by …?” questions.

“The numerals look like integer numbers. Is that what you meant? Does it matter if I use 32-bit BCD? How about ASCII encoded numeral characters? Will ‘2’ appear again in the left column when more rows are added?”

“Is ‘Joe Namath’ the same as ‘Broadway Joe’? Is that ‘Joe Montana’ from Calistoga or my brother Joe from Missoula, Montana?”

“Do you mean the ‘Joe Namath’ restaurant?”

“Don’t the 49ers play at Levi’s Stadium in Santa Clara? I just saw the Jets play a home game in New Jersey? Aren’t the Jets a gang in West Side Story?”

Notice different levels of “What is meant by?”. What’s going on with the numerals is relevant to a DBA implementing a database and largely irrelevant to an NFL fan using the database. Semantic information exists at all the levels. These levels are up and down the abstraction ladder of meaning and existence. At the top of the abstraction ladder is the concept that things exist independent of descriptive language and symbolic representation. Near the bottom of the ladder is the knowledge that your computer can represent integers between -2,147,483,648 and 2,147,483,647 in one interpretation of 32 bits of storage and integers between 0 and 4,294,967,295 in a different interpretation. What we are after with a knowledge graph from existing data is to collect the intended meaning from metadata, code, developers etc. into the base level ontology, and then federate with other ontologies at various levels of abstraction.

First Stop, Temple of Athena Pronaia in Delphi

Temple of Athena© Federica Grassi / Moment / Getty Images

Wisdom, knowledge or prophecy? ask Pythia hopped up on methane a few hundred meters away at Apollo’s temple. Prophecy is only data. Data can be confusing in combination with limited knowledge. Wisdom is awareness of the gray areas around any single truth, awareness of larger context than the point of focus. Knowledge tracks relationships and relevance amongst the components, gained by experience, recorded perhaps and passed around and refined via language. Did you hear about the philosophers over in Athena’s city Athens? “A human is a featherless biped”, said one. The Cynic plucked all the feathers off a chicken and brought the “human” to show and tell.
Knowledge graphs are a representation of relationships. On our road trip we are specifically interested in the things that have information about them stored in a relational database. The first artifact to collect on the trip is identification of those entities and what the database creators and maintainers had in mind about those entities. For our data, pretty much anyone can guess “non-fictional people” is one entity. Probably just as many might guess “US cities” are involved. A domain expert or even casual observer that knows anything about the NFL can tell us we are dealing with NFL teams and NFL players. A domain expert will quickly understand the database is about what players belong to what teams. That is the initial artifact, knowledge that the database is about two entities, players and teams and that players “play for” teams. Cities are interesting but not at the same level as players and teams.

primary entities

Next Stop, Virginia City Montana

Virginia City

Something was happening here once but it is a ghost town now.

Get the data. Data is everything. Once we have the data we will know everything. After the gold rush plays out, move on.

In addition to data in a relational database system there is fundamental knowledge of the datatypes for each column stored as DDL in the database.

DDL knowledge

There is direct and latent semantic information from the DDL on display. The table names give a good indication of what is meant by the ‘name’ columns in each table, the entities. The other column names give some clues to the intended meaning of their contents. Latent semantic information that a human would notice but not a computer is that ‘id’ is meant to be a primary key for ‘player’, ‘idteam’ is meant to be a primary key for ‘team’ and ‘team_id’ is meant to be a foreign key pointing to the ‘team’ table.

A number of tools can read DDL from a relational database and map it to an RDF graph. of the. The following RDF mapping graph was created by D2RQ from the DDL sample above.

D2RQ RDF graph

The direct semantic information, that which is likely to help answer the DBA’s “what is meant by …?” questions can be harvested from the database system in an ontology. For our two tables we’ll get something like the above. Notice the table names are identified as classes, the column names as properties and varchar types as “rdfs:range xsd:string”. The integer column data types are not identified. For the most part we get a collection of the table names and the column names which is the core semantic content of our DDL. Adding the integer datatype information to the ontology is an example of a human’s role to add semantic information if it is relevant or ‘meaningful’. An example when it is not relevant would be if the player id had no independent value as a property of the player in the graph and where it is relevant would be if the database ever needed to be recreated from the graph.

The above ontology is itself a graph. In triple form there are 44 triples, a few of which are listed below. Triple form is subject, predicate, object as indicated by the s, p, o column headings displayed.

RDF Graph Triple Form

At the simplest level of relational to RDF graph design, often called direct mapping, each data cell of a relational table maps to a triple with the value in the cell the object, the column name the predicate, and a unique row identifier, normally the primary key, the subject. Table cells with null values are typically designed to not produce a triple. Considering a full table, that is no nulls, with n-rows and c-columns and one of the columns a primary key we expect n(c – 1) triples of the form


tablename:pkvn        column name        cell value

as shown
relational to triple mapping

In order to build the graph (generate the data triples) we need a design that captures the basic mapping scenario described above. R2RML is a W3C standard markup language designed to model mapping of  relational data to an RDF graph. The two-step process is to create a design in R2RML by hook or by crook then connect to the database and drive table data through the R2RML design to produce a graph. This basic, direct mapping is simple enough to get a computer generated starter by connecting to the database and reading DDL. Doing that with the above simple DDL gets something like the following.

R2RML mapping

Remember, we have no keys defined in the DDL, primary or foreign. Absent primary keys the subject map for each table defines a BlankNode or anonymous resource with identifiers built from concatenating all values in the row. Here is the opportunity for a human to step in and recognize the latent semantic information in the ‘id’ and ‘idteam’ columns and edit the mapping using them as primary keys. This also applies to ‘team_id’ which has latent semantic content of being a foreign key.

Rest Stop

rest area

Just to see what we get, let’s generate the graph described by the R2RML design.

blank node graph

Not that useful. We did get blank nodes (_:) for the subjects as expected and they are ugly and pretty much useless.

Next Stop, New York City

New York City

It looks like the people that built this know what they are doing. From this distance, it looks like they know what relates to what, where to put things and how to identify and keep track of everything.

Instead of doing the human intervention on the graph design above let’s surface the latent semantic information in the keys and define primary and foreign keys in the DDL. If all goes well, we can use the resulting R2RML as a cheat sheet to manually add the information to a graph design when we don’t have access to cleaning up the DDL.

keys DDL

Let’s see what an R2RML mapper does with the additional DDL.

Key DDL improved R2RML

Much better, we have actual URI’s for the two subjectMaps and the foreign key relationship is captured. Naming that relationship “player:ref-team_id” can certainly use a human touch and change it to something a bit more meaningful like “player:playsfor”. Also, it is common to remove the “pkname=” from the subject map and just use the pk value.

rr:subjectMap [ rr:class rdfdev:team; rr:template "rdfdev/team/{`idteam`}"; ];

instead of

rr:subjectMap [ rr:class rdfdev:team; rr:template "rdfdev/team/idteam={`idteam`}"; ];

The produced graph, n-triples now look like

n-triples

Loading them in a triple store together with the ontology, a sample incorporating prefix substitution looks like the following.

sample piece of graph

and an interesting SPARQL query

SPARQL query

gives

SPAEQL results

The intention of the database creators, to represent players and teams, and that players are members of teams, could be scattered. Possibly buried in application code where a programmer knew to make a join on the appropriate id columns.  Whether implicit or explicit we have brought it forward in creating the predicate player:playsfor .

Display (from https://github.com/fatestigma/ontology-visualization)
ontology visualization

Note that player:team_id and team:idteam are extraneous and do not need to be here.

Next Stop, Part 2

In Part 2 of the road trip we'll look at federation with established ontologies. Players are people and a well known “people” ontology is foaf. How would we incorporate foaf into our graph? We'll also take a look at inference and entailments.

I know

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha