This is the first of a series of blog posts on modeling Knowledge Graphs in RDF. Or more specifically, how one should approach modeling since for most developers and project managers, RDF is quite foreign and rather exotic when compared to relational modeling, with which they are quite familiar.
For starters, all modeling in RDF is done by specifying triples that describe modeling logic. Triples take the form of Subject --> Predicate --> Object. Consequently, both your data and your "schema" thus use the same triple-based persistence mechanism. For historical reasons, the RDF universe distinguishes between the two types of triples by separating them into what is known as the ABox versus the TBox. The ABox, (Assertion Box, or data) is where instances of classes are stored. i.e. your traditional data. e.g. Product123 hasColor Red. The TBox (Terminological box, or schema) is where classes, ontologies, and rules are stored. e.g. Bicycle is_a_type_of Product. In Oracle's implementation of RDF, the TBox and ABox triples are stored in separate partitions in the database.
The combination of TBox and ABox is what is known as a Knowledge Graph. You need both in order to have a usable knowledge graph, otherwise all you would have is a bunch of ABox data in triples format which is no better than storing it in relational form. Further, it is the logic maintained by the TBox that enables inferencing/reasoning to occur (e.g. the brother of my father is my uncle) -- and that is the real magic that RDF can deliver. Specifically, your application doesn't need to manage or extend the model -- the model can (and should) manage itself. This makes your application much less brittle and aligns much better with the well-known MVC pattern.
Going forward, this series of blogs will focus on RDF modeling which mostly involves the TBox.
While ABox and TBox are useful historical distinctions, for real-world projects there is a very real grey area between the ABox and TBox. This grey area we will henceforth call the CBox* (Category Box, or taxonomies). Implemented this way, we can now specify different subdomains for the URIs of each of the TBox, CBox, and ABox triples. This will minimize any one group's deliverables from interfering with any other group's deliverables. Examples of such base URIs might look like:
But of course, you are free to use any URI pattern you deem fit. (Heads up: URIs should not change once you go into production)
From a dependency point of view, the ABox depends on everything else being in place prior to any data wrangling, so data transformation into RDF should ideally be the last thing you do. Likewise, the CBox has an obvious dependency on the TBox. As such, you should always start from the ontology on down. Many people disagree with this advise however. I believe that is because they envision an ontology as being this huge monolithic deliverable (e.g. FIBO). While some projects may in fact need such complexity, the majority of projects don't and shouldn't. Seriously.
Thus the assumption here is that your top level ontology will be compact, limited in scope, domain-specific but still general-purpose, reusable/flexible, easily documented, and most of all not be overly complex or hyper-detailed. In short, your ontology should be "doable" and not take months to implement, otherwise, you will inevitably encounter scope creep. For the majority of projects, don't do that.
Once your core concepts/ontology is implemented, next up is to implement your various formal and informal taxonomies, which will be built extending the above-mentioned core concepts. Note that a formal ontology is one where all children are "types" of their parents. If not, then such a taxonomy would be considered informal. For example, France/Paris/EiffelTower is an informal taxonomy because Paris is not a type of France. Whereas Mammal/Dog/Poodle is formal, being that a poodle is a type of dog and dog is a type of mammal. If implementing your taxonomies reveal missing features in your core concepts (which should never happen assuming it were done right in the first place), you will need to go back to your ontologist and ask them how they missed providing support for the taxonomies you needed to implement. Seriously, this should never happen unless you are extremely unlucky or are dealing with bad karma.
Next up would be to implement any inferencing rules and whether or not they will be materialized (in Oracle's implementation of RDF, the default is that inferences, as defined in the RulesBase, are materialized in the RDF_LINK$ Partition 3). These decisions should involve the whole team since they affect functionality, usability, scalability, as well as performance. In general one does not implement every possible inference simply because that would likely generate an exponential amount of data. The choice of which inference to implement should be based solely on addressing documented use-cases for your end-users and no more. Finally, once the TBox, CBox, and inferencing rules are complete, you can start adding your data to the ABox.
From a governance point of view, managing the TBox is the domain of the ontologist. Whereas managing the CBox is the domain of the taxonomist. And likewise, managing the ABox is the domain of the content creators and/or data wranglers who typically use tools such as R2RML mappers in order to ETL relational data into RDF.
The following diagram shows the relationships of the various "boxes" to each other.
Note the pyramid shape: An ideal top level ontology consisting of core concepts should only have hundreds, not thousands (much less tens of thousands!) of classes being defined. In stark contrast to this advice, I have seen ontologies with 300k or more classes defined (e.g. FIBO) -- and you can bet your unlucky stars that such over-designed ontologies take years and teams of people to master and understand. Do yourself a favor: hire a competent ontologist to define a compact but useful set of core concepts but keep the scope limited to just the small top of the pyramid. Additionally, I strongly suggest adopting a proven upper-level ontology such as gist (see: https://www.semanticarts.com/gist/) to help bootstrap your project and keep your TBox development on track and within scope.
Likewise, engage your taxonomist(s) to focus on the CBox definitions, which will cover all sorts of "bucket-like" structures from taxonomies, controlled vocabularies, master lists & arrays, and tag-based folksonomies. This is where your formal and informal taxonomies will be defined.
Because inferencing requires database resources (both storage and CPU), everyone, including the DBA, should be involved in deciding how much inferencing your application will need to implement and whether or not such inferences will be materialized.
Finally, the instances of the data are the domain of the data wranglers. Typically, they will be using tools such as R2RML to extract, transform, and load relational data into RDF. Needless to say, they will be mapping the data to your existing onotology/taxonomies, without which they would have to invent an arbitrary mapping (a.k.a. a "putative ontology") -- something you would certainly have to eventually abandon or significantly refactor**. Other tools such as RML (a superset of R2RML) allow data wranglers to ETL data from non-relational sources such as big data, spreadsheets, JSON, etc.
There you have it. At a high level, you now know who to hire, what their deliverables are, the appropriate scope of the project that they are responsible for, and what the order of implementation should be. Next up: using gist upper-level ontology to implement useful, reusable micro-patterns that developers need for most any application.
* Coined by Dave McComb of Semantic Arts
** the one exception to this: when you have an existing 3rd party ontology to which you must adapt (e.g. FIBO). In this case, it is likely a better strategy to create a putative ontology directly from your data first. Then the scope of the project becomes: map the putative ontology to the 3rd party ontology. Then refactor.