Incomplete Information in RDF

Incomplete information has been studied in-depth in relational databases and knowledge representation. It is also an important issue in Semantic Web frameworks such as RDF, description logics, and OWL 2 especially given that all these systems rely on the Open World Assumption (OWA). In the context of the Web, incomplete information has recently been studied in detail for XML. There have also been some recent papers in the area of Semantic Web, but these are either specific to a domain (e.g., temporal) or they deal with more general issues, such as questioning whether the SPARQL query language is appropriate for RDF given the OWA that is typically associated with the framework.

In the context of TELEIOS, we extended RDF with the ability to define a new kind of literals for each datatype that can be used to represent values of properties that exist but are unknown or partially known. In the proposed extension of RDF, called RDFi, such literals are allowed to appear only in the object position of triples. RDFi allows partial information regarding property values to be expressed by a quantifier-free formula of a first-order constraint language.

Following ideas from the incomplete information literature, we developed a semantics for RDFi databases and SPARQL query evaluation. The semantics defines the set of possible RDF graphs corresponding to an RDFi database and the fundamental concept of certain answers for SPARQL query evaluation over an RDFi database. We transferred the well-known concept of representation system to the case of RDFi, and showed that CONSTRUCT queries without blank nodes in their templates and using only operators AND, UNION, and FILTER or the restricted fragment of graph patterns corresponding to the well-designed patterns can be used to define a representation system for RDFi.

We also showed some interesting monotonicity properties for CONSTRUCT queries. We defined the fundamental concept of certain answers to SPARQL queries over RDFi databases and presented an algorithm for its computation. Finally, we presented preliminary complexity results for computing certain answers by considering equality, temporal, and spatial constraint languages, and the class of CONSTRUCT queries of our representation system. Our results showed that the data complexity of evaluating a query of this class over RDFi databases increases from LOGSPACE (the upper bound for evaluating queries from this class over RDF graphs) to coNP-complete for the case of equality and temporal constraints. This result is in line with similar complexity results for querying incomplete information in relational databases. The same coNP-completeness bound is shown for the case of spatial constraints on rectangles in ℚ2. For topological constraints over more general spatial regions (regular closed subsets of ℚ2), the best upper bound that we could show is EXPTIME. It is an open problem whether a better complexity bound can be achieved in this case.

Our work on RDFi goes beyond the proposals of the geospatial extensions of SPARQL, stSPARQL and GeoSPARQL, which cannot query incomplete geospatial information. While GeoSPARQL provides a vocabulary for asserting topological relations (the topology vocabulary extension), the complexity of query evaluation over RDF graphs in this case has not been investigated so far in any detail and remains an open problem. To the best of our knowledge, all current implementations of GeoSPARQL (e.g., in geospatial RDF stores such as Strabon and Parliament) have so far omitted the implementation of the topology vocabulary extension of the standard. The functionality provided by the topology vocabulary extension is an important feature of GeoSPARQL that goes beyond the querying capabilities offered by geospatial extensions to SQL in today's spatially-enabled geospatial RDBMS (e.g., PostGIS). It enables the representation and querying of topological information among spatial regions for which we might not have exact geometric information (e.g., vernacular geography regions, that is, imprecise regions that capture an ordinary citizen's perception of some areas) or for which topological relations to other spatial regions are fixed so we would prefer to store them explicitly, instead of recomputing them every time a relevant need arises (e.g., administrative geography information for various countries).

Similar issues in the representation and querying of topological information expressed in RDF arise in spatial description logics and their reasoners, such as RacerPro and PelletSpatial.

The functionality of the topology vocabulary extension of GeoSPARQL needs to be supported in the new generation of geospatial RDF stores given the existence of very large linked geospatial datasets such as the Administrative Geography of Great Britain published by Ordnance Survey that contain a large number of binary topological relations among spatial regions.

In the context of TELEIOS, and specifically in deliverables D4.2 and D4.3, we showed that offering this functionality is very challenging and cannot be provided using off-the-shelf state of the art reasoners for topological relations that have been in use in the Qualitative Spatial Reasoning (QSR) community for a long time.

In the above deliverables, we presented the components and features of GeoSPARQL and discussed in detail how one could use it to answer queries that involve qualitative (i.e., topological relations) and quantitative (i.e., geometries) geospatial information. Leaving geometrical information aside, since we know how to store and query it (due to our work on Strabon), we concentrated on the topology vocabulary extension of GeoSPARQL. By way of examples, we showed that an implementation of this extension can utilize techniques and reasoners developed in the QSR community for checking the consistency of and computing entailments from a set of binary topological relations among spatial regions. In pass we pointed out that although the topology vocabulary extension of GeoSPARQL has been designed only to provide a vocabulary for asserting and querying binary topological relations among spatial regions, a set of such relationships can actually entail other indefinite (disjunctive) topological relationships that cannot be represented as triples, although they can be inferred by available topological spatial reasoners. Therefore, an extension of the GeoSPARQL standard is in order for covering this case.

Secondly, we discussed possible ways to implement the topology vocabulary extension of GeoSPARQL in geospatial RDF stores like Strabon and Parliament that support this OGC standard. For this, we relied on state of the art qualitative spatial reasoners such as Renz's, PyRCC8 and PyRCC8-TRIANGLE – that have been implemented in our group and are discussed below –, and PelletSpatial. We presented in detail the implementation of this GeoSPARQL component in our own geospatial RDF store Strabon using a modified version of Renz's reasoner. Then, we evaluated the Strabon implementation experimentally and compared it with alternative implementations such as PelletSpatial, PyRCC8, PPyRCC8, and an implementation of the basic algorithm (path-consistency) implemented by these reasoners in the relational DBMS PostgreSQL.

Our detailed experimental evaluation makes use of large linked geospatial datasets that are publicly available and contain a large number of binary topological relations among spatial regions (e.g., the dataset for the Administrative Geography of Great Britain and the Global Administrative Areas dataset). The main lesson learned from our extensive experimental evaluation is negative: none of the implementations considered can scale to more than 50000 binary topological relations and 10000 regions, thus they are not up to dealing with the datasets publicly available today. Finally, we analyzed the structural characteristics of the real-world linked geospatial datasets that we considered, and pointed out techniques that might offer significant improvements to the state-of-the art. Implementation and evaluation of these ideas is left to future work.

Apart from the aforementioned deliverables, the above work has been presented in the following papers.

  1. Charalampos Nikolaou and Manolis Koubarakis: "Incomplete Information in RDF". In the 7th International Conference on Web Reasoning and Rule Systems (RR 2013), Mannheim, Germany, July 27-29, 2013.

  2. Charalampos Nikolaou and Manolis Koubarakis: "Querying Incomplete Geospatial Information in RDF". In the 13th International Symposium on Spatial and Temporal Databases (SSTD 2013), Munich, Germany, August 21-23, 2013.

  3. Charalampos Nikolaou, Manolis Koubarakis: Incomplete Information in RDF. CoRR abs/1209.3756 (2012)