Our goal in GeoKnow is to bring geospatial data in the Linked Data Web. Among others, our work will provide the means for efficiently managing and querying geospatial data in RDF stores.
But how can we measure our success? And what is the yard-stick we will compare ourselves to?
While many popular RDF stores advertise geospatial capabilities, we found little or no quantitative information in the relevant literature regarding the following:
- Conformance to GeoSPARQL. Given GeoSPARQL’s status as an OGC standard and the ongoing work of various research groups regarding its integration into other Semantic Web technologies, there is a clear need for examining GeoSPARQL-compliance for RDF stores. In the highly standards-driven domain of GIS, conformance with open standards is a highly advertised and desirable property for open and proprietary software alike. End users select products based on features which have been validated under open tests that can be validated by anyone. In order for GeoSPARQL to reach the mainstream and be actively adopted by the Semantic Web and GIS community alike, we need a similar open process for compliance testing of the various products and systems
- Performance under realistic geospatial workloads. Although there are several acknowledged benchmarks for evaluating RDF stores (e.g. BSBM), the only benchmark handling geospatial RDF is SSWB. SSWB, while it includes some important geospatial query types, is based on synthetic data and evaluates only one aspect of performance (query duration). There is a clear lack of a feature complete benchmark that is based on realistic data and workloads, in order to measure and highlight both the novel capabilities geospatial RDF stores provide, but also the support for typical everyday applications of geospatial data management.
- Comparison with geospatial RDBMSs. Geospatial databases are used across the globe by a diverse community that produces, queries, and analyzes geospatial data. The performance, integration capabilities, and overall maturity of geospatial RDBMSs form the natural baseline of the GIS community regarding data management. Geospatial RDF stores will be compared along these lines. Of course they also provide functionalities that typical GIS and RDBMSs cannot support. This is something we can easily advertise and convince the GIS community about. However, we have no answers on what are the side-effects of applying a geospatial RDF store in a real-world setting. They might never be able to reach the native performance of an RDBMS. However, is that lack of speed balanced by increased interoperability? Similarly for scaling, compatibility with existing tools, etc. Potential users would like accurate information, enabling them to reach informed decisions regarding these trade-offs.
We feel that since incorporating geospatial capabilities into RDF stores is a subject of ongoing research and commercial efforts, it is important for the community to have access to a concrete methodology that describes the required resources and steps for objectively and realistically evaluating geospatial RDF stores.
With these goals, we introduced a benchmarking methodology for evaluating geospatial RDF stores, which will form the basis for a formal, feature-complete benchmark in the future.
Based on this methodology, Athena completed a first-cut evaluation of five RDF stores with geospatial support: Virtuoso Universal Server, Parliament, uSeekM, OWLIM-SE and Strabon. Further, we compared these RDF stores with two prominent geospatial RDBMSs: Oracle Spatial and PostGIS.
- For our evaluation we used real geospatial data, and specifically, OpenStreetMap data for Great Britain covering England, Scotland, and Wales (as of 05/03/2013).
- We transformed this dataset into RDF triples (~26M) using TripleGeo.
- We tested a total of 21 SPARQL queries, covering several geospatial features: location queries, range queries, spatial joins, nearest neighbor queries and spatial aggregates. For each query we created a proper representation that would be compatible with each RDF store.
- In parallel, we tested the same set of queries on the geospatial RDBMSs.
- Apart from query times, we also measured other quantities, such as number of loaded triples, loading time and index creation time.
- We evaluated the conformance of each RDF store to the GeoSPARQL standard.
A short description of the test queries used is given in the table below:
Our evaluation surfaced several interesting issues (some already known and others newly found), as well as valuable intuition and directions towards improving the functionality of geospatially-enabled RDF Stores.
- First of all, it became clear that considerable effort from industry and academia is needed for adding baseline functionality on current RDF stores, regarding both the geospatial features and the geospatial standards they support. For example, some stores support only point geometries, while only a few of them support the GeoSPARQL standard.
- Another important issue is the efficiency of current RDF stores, concerning loading, indexing and querying times, as compared to RDBMS’s.
- Finally, several enhancements regarding both functionality and efficiency comprise very interesting research issues that can leverage the current performance of RDF stores, establishing them as first class research and industrial data management systems.
Detailed results, as well as a thorough survey of current geospatial and semantic standards can be found in our public deliverable D2.1.1 ‘Market and Research Overview’.
You can also find more information regarding our testing environment in this post. The complete set of pre-built Virtual Machines, queries, and data sets we used are are also freely available for download.