Members of the InfAI’s GeoKnow team attended ISWC 2013 in Sydney, Australia. There, they presented several papers pertaining to the project. The first talk was centered around ORCHID, a scalable link discovery approaches for geo-spatial resources which deals with WKT data. ORCHID was shown to outperform the state of the art w.r.t. scalability and is now an integral part of the link discovery framework LIMES. Furthermore, an approach to semi-automatically improve the schema of knowledge bases was integrated into DL-Learner and shown at the conference. This helps knowledge engineers to more easily structure their data – a problem which is often perceived as a bottleneck in achieving the Semantic Web vision. Another timely topic – data quality assurance via crowdsourcing – was also accepted and demonstrated at the conference. In the fourth talk, the GeoKnow team presented DAW, a duplicate-aware solution for federated SPARQL queries. DAW can be used in combination with any federated SPARQL query engine to optimize the number of sources it selects, thus reducing the overall network traffic as well as the query execution time of existing engines. More information on DAW can be found here. We are also very happy that GeoKnow won the Big Data Prize at the Semantic Web Challenge. The paper can be found here and the demo is here.
Manifold RDF data contain implicit references to geographic data. For example, music datasets such as Jamendo include references to locations of record labels, places where artists were born or have been, etc. The aim of the spatial mapping component, dubbed GeoLift, is to retrieve this information and make it explicit.
Geographical information can be mentioned in three different ways within Linked Data:
- Through dereferencing: Several datasets contain links to datasets with explicit geographical information such as DBpedia or LinkedGeoData. For example, in a music dataset, one might find information such as http://example.org/Leipzig owl:sameAs http://dbpedia.org/resource/Leipzig. We call this type of reference explicit. We can now use the semantics of RDF to fetch geographical information from DBpedia and attach it to the resource in the other ontology as http://example.org/Leipzig and http://dbpedia.org/resource/Leipzig refer to the same realworld object.
- Through linking: It is known that the Web of Data contains an insufficient number of links. The latest approximations suggest that the Linked Open Data Cloud alone consists of 31+ billion triples but only contains approximately 0.5 billion links (i.e., less than 2% of the triples are links between knowledge bases). The second intuition behind GeoLift is thus to use link discovery to map resources in an input knowledge base to resources in a knowledge that contains explicit geographical information. For example, given a resource http://example.org/Athen, GeoLift should aim to find a resource such as http://dbpedia.org/resource/Athen to map it with. Once having established the link between the two resources, GeoLift can then resolve to the approach defined above.
- Through Natural Language Processing: In some cases, the geographic information is hidden in the objects of data type properties. For example, some datasets contain biographies, textual abstracts describing resources, comments from users, etc. The idea here is to use this information by extracting Named Entities and keywords using automated Information Extraction techniques. Semantic Web Frameworks such as FOX have the main advantage of providing URIs for the keywords and entities that they detect. These URIs can finally be linked with the resources to which the datatype properties were attached. Finally, the geographical information can be dereferenced and attached to the resources whose datatype properties were analyzed.
The idea behind GeoLift is to provide a generic architecture that contains means to exploit these three characteristics of Linked Data. In the following, we present the technical approach underlying GeoLift.
GeoLift was designed to be a modular tool which can be easily extended and re-purposed. In its first version, it provides two main types of artifacts:
- Modules: These artifacts are in charge of generating geographical data based on RDF data. To this aim, they implement the three intuitions presented above. The input for such a module is an RDF dataset (in Java, a Jena Model ). The output is also an RDF dataset enriched with geographical information (in Java, an enriched Jena Model ).
- Operators: The idea behind operators is to enable users to define a workflow for processing their input dataset. Thus, in case a user knows the type of enrichment that is to be carried out (using linking and then links for example), he can define the sequence of modules that must be used to process his dataset. Note that the format of the input and output of modules is identical. Thus, the user is empowered to create workflows of arbitrary complexity by simply connecting modules.
The corresponding architecture is shown below. The input layer allows reading RDF in different serializations. The enrichment modules are in the second layer and allow adding geographical information to RDF datasets by different means. The operators (which will be implemented in the future version of GeoLift) will combine the enrichment modules and allow defining a workflow for processing information. The output layer serializes the results in different format. The enrichment procedure will be monitored by implementing a controller, which will be added in the future version of GeoLift.
For our evaluation of current geospatial RDF stores (for which you can read more here), we have set up five (5) pre-built Virtual Machines. Each one contains a working installation of one of these geospatial RDF stores (in Debian 6.x “squeeze” OS):
- Virtuoso Universal Server (7.0, ColumnStore edition)
- Parliament (2.7.4 quickstart)
- uSeekM (1.2.0-a5, on top of PostgreSQL 8.4 and PostGIS 1.5)
- OWLIM-SE (Trial version 5.3.5849)
- Strabon (3.2.3, on top of PostgreSQL 8.4 and PostGIS 1.5)
The VMs are available at: ftp://firstname.lastname@example.org/ (username:guest, password: be-my-guest).
The images are normal XEN domU disk images, which can be used to create a functional domU guest. Of course, you should also provide:
- your own guest configuration file at /etc/xen/<the-vm>.cfg (or wherever you have chosen to configure your domU guests)
- a swap image file
Use root:root as the root password for your newly created machine (and of course, change it just after your first login). After creating the machine, you should login with the serial console (eg. xm console <the-vm>) and:
- configure your network interfaces to adapt to your local network
- configure your hostname and your /etc/hosts file
- restart your network interfaces
We have also uploaded at the same ftp location the dataset we used for our evaluation: OpenStreetMap data for Great Britain covering England, Scotland, and Wales (as of 05/03/2013) in ESRI shapefile format. Of the available OSM layers, only those concerning road network (roads), points of interest (points) and natural parks and waterbodies (natural) were actually utilized. From each original layer, only the most important attributes were retained: shape, name, osm_id, type.
You can find more information on how we used these VMs for our evaluation, in the public deliverable D2.1.1 ‘Market and Research Overview’.
Our goal in GeoKnow is to bring geospatial data in the Linked Data Web. Among others, our work will provide the means for efficiently managing and querying geospatial data in RDF stores.
But how can we measure our success? And what is the yard-stick we will compare ourselves to?
While many popular RDF stores advertise geospatial capabilities, we found little or no quantitative information in the relevant literature regarding the following:
- Conformance to GeoSPARQL. Given GeoSPARQL’s status as an OGC standard and the ongoing work of various research groups regarding its integration into other Semantic Web technologies, there is a clear need for examining GeoSPARQL-compliance for RDF stores. In the highly standards-driven domain of GIS, conformance with open standards is a highly advertised and desirable property for open and proprietary software alike. End users select products based on features which have been validated under open tests that can be validated by anyone. In order for GeoSPARQL to reach the mainstream and be actively adopted by the Semantic Web and GIS community alike, we need a similar open process for compliance testing of the various products and systems
- Performance under realistic geospatial workloads. Although there are several acknowledged benchmarks for evaluating RDF stores (e.g. BSBM), the only benchmark handling geospatial RDF is SSWB. SSWB, while it includes some important geospatial query types, is based on synthetic data and evaluates only one aspect of performance (query duration). There is a clear lack of a feature complete benchmark that is based on realistic data and workloads, in order to measure and highlight both the novel capabilities geospatial RDF stores provide, but also the support for typical everyday applications of geospatial data management.
- Comparison with geospatial RDBMSs. Geospatial databases are used across the globe by a diverse community that produces, queries, and analyzes geospatial data. The performance, integration capabilities, and overall maturity of geospatial RDBMSs form the natural baseline of the GIS community regarding data management. Geospatial RDF stores will be compared along these lines. Of course they also provide functionalities that typical GIS and RDBMSs cannot support. This is something we can easily advertise and convince the GIS community about. However, we have no answers on what are the side-effects of applying a geospatial RDF store in a real-world setting. They might never be able to reach the native performance of an RDBMS. However, is that lack of speed balanced by increased interoperability? Similarly for scaling, compatibility with existing tools, etc. Potential users would like accurate information, enabling them to reach informed decisions regarding these trade-offs.
We feel that since incorporating geospatial capabilities into RDF stores is a subject of ongoing research and commercial efforts, it is important for the community to have access to a concrete methodology that describes the required resources and steps for objectively and realistically evaluating geospatial RDF stores.
With these goals, we introduced a benchmarking methodology for evaluating geospatial RDF stores, which will form the basis for a formal, feature-complete benchmark in the future.
Based on this methodology, Athena completed a first-cut evaluation of five RDF stores with geospatial support: Virtuoso Universal Server, Parliament, uSeekM, OWLIM-SE and Strabon. Further, we compared these RDF stores with two prominent geospatial RDBMSs: Oracle Spatial and PostGIS.
- For our evaluation we used real geospatial data, and specifically, OpenStreetMap data for Great Britain covering England, Scotland, and Wales (as of 05/03/2013).
- We transformed this dataset into RDF triples (~26M) using TripleGeo.
- We tested a total of 21 SPARQL queries, covering several geospatial features: location queries, range queries, spatial joins, nearest neighbor queries and spatial aggregates. For each query we created a proper representation that would be compatible with each RDF store.
- In parallel, we tested the same set of queries on the geospatial RDBMSs.
- Apart from query times, we also measured other quantities, such as number of loaded triples, loading time and index creation time.
- We evaluated the conformance of each RDF store to the GeoSPARQL standard.
A short description of the test queries used is given in the table below:
Our evaluation surfaced several interesting issues (some already known and others newly found), as well as valuable intuition and directions towards improving the functionality of geospatially-enabled RDF Stores.
- First of all, it became clear that considerable effort from industry and academia is needed for adding baseline functionality on current RDF stores, regarding both the geospatial features and the geospatial standards they support. For example, some stores support only point geometries, while only a few of them support the GeoSPARQL standard.
- Another important issue is the efficiency of current RDF stores, concerning loading, indexing and querying times, as compared to RDBMS’s.
- Finally, several enhancements regarding both functionality and efficiency comprise very interesting research issues that can leverage the current performance of RDF stores, establishing them as first class research and industrial data management systems.
Detailed results, as well as a thorough survey of current geospatial and semantic standards can be found in our public deliverable D2.1.1 ‘Market and Research Overview’.
You can also find more information regarding our testing environment in this post. The complete set of pre-built Virtual Machines, queries, and data sets we used are are also freely available for download.
In the past month (April 2013), we invited geospatial data consumers and providers, GIS experts and Semantic Web specialists to participate in our Geospatial Data Users Survey. The goal of this survey was to collect general use cases and user requirements from people outside the GeoKnow consortium. We publicised the survey using mailing lists and social networks, and it was available for 25 days. During this period we received 122 responses, of these we had 51 full responses and 71 incomplete ones. Since we were interested in having good quality surveys, so we performed a manual control, which resulted in 39 useful responses – not too bad. In this blog post, we aim to show some interesting results from our survey. If you are interested to learn more about the results of this survey, you can check the public derivable available here.
One of the goals of this survey was to learn more use cases different from those we already consider in the project. Thus, we asked participants how they use geospatial data in their work. To analyse this question, we grouped answers in different types which is shown in the graph at the right. Most of the scenarios were about visualisation and analysis, followed by geospatial data creation scenarios.
We asked users for the most popular tools they use at their work. Responses to this question were OSM and Google Maps/Earth, as well as other GSI. After we asked about the features they like the most about these tools, participants reflected preference by easy to use and free tools for their work, referring to their popular choices of Google Maps or OSM. Also having an API to interact with the application is important. The fact that applications provided data that can be integrated was also appreciated. GIS applications were considered as difficult. Integration and interoperability were mentioned as goals. Besides the previous question, we were also interested in knowing the missing functionalities that may improve their work. A list of these functionalities grouped by the related work package within GeoKnow is presented in the image below.
This survey allow us to learn from different use cases, main features used, and desired functionalities, that are to be considered in the creation of the GeoKnow Generator. Some important high level findings from the survey were the emphasis in interoperability and reusability through open APIs and approachable visualisation components, support for common geospatial data formats and geodbs, and the necessity of simple tools to support data integration/reuse from geospatial LOD sources.
We also found that some of the ideas of the GeoKnow project are further supported by user requirements like the integration of private and public data and the importance of using the web as an integration platform.