Members of the InfAI’s GeoKnow team attended ISWC 2013 in Sydney, Australia. There, they presented several papers pertaining to the project. The first talk was centered around ORCHID, a scalable link discovery approaches for geo-spatial resources which deals with WKT data. ORCHID was shown to outperform the state of the art w.r.t. scalability and is now an integral part of the link discovery framework LIMES. Furthermore, an approach to semi-automatically improve the schema of knowledge bases was integrated into DL-Learner and shown at the conference. This helps knowledge engineers to more easily structure their data – a problem which is often perceived as a bottleneck in achieving the Semantic Web vision. Another timely topic – data quality assurance via crowdsourcing – was also accepted and demonstrated at the conference. In the fourth talk, the GeoKnow team presented DAW, a duplicate-aware solution for federated SPARQL queries. DAW can be used in combination with any federated SPARQL query engine to optimize the number of sources it selects, thus reducing the overall network traffic as well as the query execution time of existing engines. More information on DAW can be found here. We are also very happy that GeoKnow won the Big Data Prize at the Semantic Web Challenge. The paper can be found here and the demo is here.
At this stage of the GeoKnow project, we are shaping our dissemination and exploitation strategy. In order to stay as close as possible to the needs of our potential users, we have created a survey to find out how we can help your business most.
If you work with geospatial data, your participation would be greatly appreciated. The survey will take no longer than 10 minutes to complete. There is a little incentive as well: If you leave your email address, you will be automatically entered to win one of three Amazon vouchers worth 50 Euro.
Please click on this link to take part: GeoKnow exploitation plan survey
Manifold RDF data contain implicit references to geographic data. For example, music datasets such as Jamendo include references to locations of record labels, places where artists were born or have been, etc. The aim of the spatial mapping component, dubbed GeoLift, is to retrieve this information and make it explicit.
Geographical information can be mentioned in three different ways within Linked Data:
- Through dereferencing: Several datasets contain links to datasets with explicit geographical information such as DBpedia or LinkedGeoData. For example, in a music dataset, one might find information such as http://example.org/Leipzig owl:sameAs http://dbpedia.org/resource/Leipzig. We call this type of reference explicit. We can now use the semantics of RDF to fetch geographical information from DBpedia and attach it to the resource in the other ontology as http://example.org/Leipzig and http://dbpedia.org/resource/Leipzig refer to the same realworld object.
- Through linking: It is known that the Web of Data contains an insufficient number of links. The latest approximations suggest that the Linked Open Data Cloud alone consists of 31+ billion triples but only contains approximately 0.5 billion links (i.e., less than 2% of the triples are links between knowledge bases). The second intuition behind GeoLift is thus to use link discovery to map resources in an input knowledge base to resources in a knowledge that contains explicit geographical information. For example, given a resource http://example.org/Athen, GeoLift should aim to find a resource such as http://dbpedia.org/resource/Athen to map it with. Once having established the link between the two resources, GeoLift can then resolve to the approach defined above.
- Through Natural Language Processing: In some cases, the geographic information is hidden in the objects of data type properties. For example, some datasets contain biographies, textual abstracts describing resources, comments from users, etc. The idea here is to use this information by extracting Named Entities and keywords using automated Information Extraction techniques. Semantic Web Frameworks such as FOX have the main advantage of providing URIs for the keywords and entities that they detect. These URIs can finally be linked with the resources to which the datatype properties were attached. Finally, the geographical information can be dereferenced and attached to the resources whose datatype properties were analyzed.
The idea behind GeoLift is to provide a generic architecture that contains means to exploit these three characteristics of Linked Data. In the following, we present the technical approach underlying GeoLift.
GeoLift was designed to be a modular tool which can be easily extended and re-purposed. In its first version, it provides two main types of artifacts:
- Modules: These artifacts are in charge of generating geographical data based on RDF data. To this aim, they implement the three intuitions presented above. The input for such a module is an RDF dataset (in Java, a Jena Model ). The output is also an RDF dataset enriched with geographical information (in Java, an enriched Jena Model ).
- Operators: The idea behind operators is to enable users to define a workflow for processing their input dataset. Thus, in case a user knows the type of enrichment that is to be carried out (using linking and then links for example), he can define the sequence of modules that must be used to process his dataset. Note that the format of the input and output of modules is identical. Thus, the user is empowered to create workflows of arbitrary complexity by simply connecting modules.
The corresponding architecture is shown below. The input layer allows reading RDF in different serializations. The enrichment modules are in the second layer and allow adding geographical information to RDF datasets by different means. The operators (which will be implemented in the future version of GeoLift) will combine the enrichment modules and allow defining a workflow for processing information. The output layer serializes the results in different format. The enrichment procedure will be monitored by implementing a controller, which will be added in the future version of GeoLift.
In the last days of July, the second meeting of GeoKnow project took place in Athens. GeoKnow members had the opportunity to meet again after the Leipzig kick-off meeting, discuss the work performed during the first 7 months of the project, as well as fix the next steps. Apart from that, our fellow partners had the chance to strall around some of the most historic and picturesque sites of Athens, like the old and the new parliament buildings, the National University of Athens, the Temple of Olympian Zeus, the Monastiraki and Plaka quarters and Acropolis, and taste some of the most iconic Greek dishes!
The first part of the meeting focused on performed work. Advances from each Work Package were presented, feeding discussions about (a) integrating currently developed tools, (b) utilizing these tools to manage/process use case datasets and (c) resolving research issues that had come up and enhancing the functionality and efficiency of the developed solutions. All partners agreed that the project advanced significantly since December, as the first tools for managing and processing geospatial, RDF data have been already developed, and very informative reports about the state of the art on geospatial and RDF data management, benchmarking and system requirements have been published as well.
Several discussions, during both meeting days, revolved around the system architecture and, specifically, the GeoKnow Generator (GKG). Concrete decisions were made about the GKG backend that will include a set of loosely integrated components for consuming, processing and exposing geospatial, Linked Data, based on Virtuoso RDF store. We also considered issues regarding user management, GKG’s front end, workflow processing and implementation of gathered system and user requirements.
With respect to geospatial information management, where GeoKnow has already provided solutions for transforming and exposing conventional geospatial data into RDF data (Sparqlify, LinkedGeoData, TripleGeo), all partners agreed that there is the potential to build (based on the work performed in Tasks 2.1 and 1.3) a timely geospatial RDF benchmark, that will be able to test efficiency and functionality capabilities of today’s RDF stores with geospatial support. Also, next steps were discussed, with emphasis on further optimizing the geospatial query capabilities of the underlying RDF store (Virtuoso).
As far as semantic integration of geospatial data is concerned, tools developed within Geoknow for enriching and interlinking geospatial RDF data (GeoLift, LIMES), were, at first, presented to the consortium. These tools triggered further discussions about the fusion and aggregation solutions currently under development and design, as well as how these tools can directly be tested and utilized into processing commercial datasets from the use case partners. Finally, a large part of our discussions was dedicated to quality measures and quality assessment of geospatial data; although these tasks are due to later periods in the project, they are of high importance for all the functionality being built in Work Package 3, since quality indicators of datasets can constitute valuable input for processing such as interlinking and fusion.
After the presentation of (implemented or under development) GeoKnow tools for visualization and authoring of geospatial RDF data, such as Facete, creative ideas were exchanged, discussing both detailed technical implementation solutions and desired functionality for end users. Some important aspects that were considered are the implementation and functionality of spatial authoring, the issue of public and spatial Linked Data co-evolution and the potential for spatial-social networking. Again, the discussions considered the use case scenarios of the project, that is, how the offered functionality can serve commercial and industrial Linked Data management and visualization needs.
In conclusion, during the GeoKnow meeting in Athens all partners exchanged interesting ideas about ongoing and future work and set more concrete objectives to achieve through the next months. We thank all the GeoKnow members for attending and contributing to this constructive meeting!
For our evaluation of current geospatial RDF stores (for which you can read more here), we have set up five (5) pre-built Virtual Machines. Each one contains a working installation of one of these geospatial RDF stores (in Debian 6.x “squeeze” OS):
- Virtuoso Universal Server (7.0, ColumnStore edition)
- Parliament (2.7.4 quickstart)
- uSeekM (1.2.0-a5, on top of PostgreSQL 8.4 and PostGIS 1.5)
- OWLIM-SE (Trial version 5.3.5849)
- Strabon (3.2.3, on top of PostgreSQL 8.4 and PostGIS 1.5)
The VMs are available at: ftp://firstname.lastname@example.org/ (username:guest, password: be-my-guest).
The images are normal XEN domU disk images, which can be used to create a functional domU guest. Of course, you should also provide:
- your own guest configuration file at /etc/xen/<the-vm>.cfg (or wherever you have chosen to configure your domU guests)
- a swap image file
Use root:root as the root password for your newly created machine (and of course, change it just after your first login). After creating the machine, you should login with the serial console (eg. xm console <the-vm>) and:
- configure your network interfaces to adapt to your local network
- configure your hostname and your /etc/hosts file
- restart your network interfaces
We have also uploaded at the same ftp location the dataset we used for our evaluation: OpenStreetMap data for Great Britain covering England, Scotland, and Wales (as of 05/03/2013) in ESRI shapefile format. Of the available OSM layers, only those concerning road network (roads), points of interest (points) and natural parks and waterbodies (natural) were actually utilized. From each original layer, only the most important attributes were retained: shape, name, osm_id, type.
You can find more information on how we used these VMs for our evaluation, in the public deliverable D2.1.1 ‘Market and Research Overview’.
Our goal in GeoKnow is to bring geospatial data in the Linked Data Web. Among others, our work will provide the means for efficiently managing and querying geospatial data in RDF stores.
But how can we measure our success? And what is the yard-stick we will compare ourselves to?
While many popular RDF stores advertise geospatial capabilities, we found little or no quantitative information in the relevant literature regarding the following:
- Conformance to GeoSPARQL. Given GeoSPARQL’s status as an OGC standard and the ongoing work of various research groups regarding its integration into other Semantic Web technologies, there is a clear need for examining GeoSPARQL-compliance for RDF stores. In the highly standards-driven domain of GIS, conformance with open standards is a highly advertised and desirable property for open and proprietary software alike. End users select products based on features which have been validated under open tests that can be validated by anyone. In order for GeoSPARQL to reach the mainstream and be actively adopted by the Semantic Web and GIS community alike, we need a similar open process for compliance testing of the various products and systems
- Performance under realistic geospatial workloads. Although there are several acknowledged benchmarks for evaluating RDF stores (e.g. BSBM), the only benchmark handling geospatial RDF is SSWB. SSWB, while it includes some important geospatial query types, is based on synthetic data and evaluates only one aspect of performance (query duration). There is a clear lack of a feature complete benchmark that is based on realistic data and workloads, in order to measure and highlight both the novel capabilities geospatial RDF stores provide, but also the support for typical everyday applications of geospatial data management.
- Comparison with geospatial RDBMSs. Geospatial databases are used across the globe by a diverse community that produces, queries, and analyzes geospatial data. The performance, integration capabilities, and overall maturity of geospatial RDBMSs form the natural baseline of the GIS community regarding data management. Geospatial RDF stores will be compared along these lines. Of course they also provide functionalities that typical GIS and RDBMSs cannot support. This is something we can easily advertise and convince the GIS community about. However, we have no answers on what are the side-effects of applying a geospatial RDF store in a real-world setting. They might never be able to reach the native performance of an RDBMS. However, is that lack of speed balanced by increased interoperability? Similarly for scaling, compatibility with existing tools, etc. Potential users would like accurate information, enabling them to reach informed decisions regarding these trade-offs.
We feel that since incorporating geospatial capabilities into RDF stores is a subject of ongoing research and commercial efforts, it is important for the community to have access to a concrete methodology that describes the required resources and steps for objectively and realistically evaluating geospatial RDF stores.
With these goals, we introduced a benchmarking methodology for evaluating geospatial RDF stores, which will form the basis for a formal, feature-complete benchmark in the future.
Based on this methodology, Athena completed a first-cut evaluation of five RDF stores with geospatial support: Virtuoso Universal Server, Parliament, uSeekM, OWLIM-SE and Strabon. Further, we compared these RDF stores with two prominent geospatial RDBMSs: Oracle Spatial and PostGIS.
- For our evaluation we used real geospatial data, and specifically, OpenStreetMap data for Great Britain covering England, Scotland, and Wales (as of 05/03/2013).
- We transformed this dataset into RDF triples (~26M) using TripleGeo.
- We tested a total of 21 SPARQL queries, covering several geospatial features: location queries, range queries, spatial joins, nearest neighbor queries and spatial aggregates. For each query we created a proper representation that would be compatible with each RDF store.
- In parallel, we tested the same set of queries on the geospatial RDBMSs.
- Apart from query times, we also measured other quantities, such as number of loaded triples, loading time and index creation time.
- We evaluated the conformance of each RDF store to the GeoSPARQL standard.
A short description of the test queries used is given in the table below:
Our evaluation surfaced several interesting issues (some already known and others newly found), as well as valuable intuition and directions towards improving the functionality of geospatially-enabled RDF Stores.
- First of all, it became clear that considerable effort from industry and academia is needed for adding baseline functionality on current RDF stores, regarding both the geospatial features and the geospatial standards they support. For example, some stores support only point geometries, while only a few of them support the GeoSPARQL standard.
- Another important issue is the efficiency of current RDF stores, concerning loading, indexing and querying times, as compared to RDBMS’s.
- Finally, several enhancements regarding both functionality and efficiency comprise very interesting research issues that can leverage the current performance of RDF stores, establishing them as first class research and industrial data management systems.
Detailed results, as well as a thorough survey of current geospatial and semantic standards can be found in our public deliverable D2.1.1 ‘Market and Research Overview’.
You can also find more information regarding our testing environment in this post. The complete set of pre-built Virtual Machines, queries, and data sets we used are are also freely available for download.
In attendance were many open and linked data enthusiasts keen to discuss the issues and challenges concerning publishing, linking and -most importantly – use of open data.
The discussions concerning open data and how to make it more accessible to all, all while standardizing the formats or otherwise using and keeping the current formats were intense and fascinating. The many presentations concerning the efforts of various governments to open their data as well as projects seeking to create practical ways of either publishing or using data were equally engaging.
GeoKnow was represented by Ontos and presented the challenges, motivations, and goals of the GeoKnow project. Feedback on the motivation for the project was positive and interest was expressed in the GeoKnow Generator as well as a possible collaboration in one of the use cases.
A further interesting development was the interest expressed by INSPIRE stakeholders for exposing INSPIRE data as Linked Open Data. This is a timely development for GeoKnow since we will be providing tools for exposing INSPIRE data as LOD with the aim of re-purposing and leveraging the EU’s growing INSPIRE geospatial data in the LOD cloud.
A big thank you to the World Wide Web Consortium (W3C), the Open Data Institute, and the Open Knowledge Foundation for making the event happen. Another big thanks goes out to Google for hosting the event and providing the much appreciated sustenance and coffee over the two days!
In the past month (April 2013), we invited geospatial data consumers and providers, GIS experts and Semantic Web specialists to participate in our Geospatial Data Users Survey. The goal of this survey was to collect general use cases and user requirements from people outside the GeoKnow consortium. We publicised the survey using mailing lists and social networks, and it was available for 25 days. During this period we received 122 responses, of these we had 51 full responses and 71 incomplete ones. Since we were interested in having good quality surveys, so we performed a manual control, which resulted in 39 useful responses – not too bad. In this blog post, we aim to show some interesting results from our survey. If you are interested to learn more about the results of this survey, you can check the public derivable available here.
One of the goals of this survey was to learn more use cases different from those we already consider in the project. Thus, we asked participants how they use geospatial data in their work. To analyse this question, we grouped answers in different types which is shown in the graph at the right. Most of the scenarios were about visualisation and analysis, followed by geospatial data creation scenarios.
We asked users for the most popular tools they use at their work. Responses to this question were OSM and Google Maps/Earth, as well as other GSI. After we asked about the features they like the most about these tools, participants reflected preference by easy to use and free tools for their work, referring to their popular choices of Google Maps or OSM. Also having an API to interact with the application is important. The fact that applications provided data that can be integrated was also appreciated. GIS applications were considered as difficult. Integration and interoperability were mentioned as goals. Besides the previous question, we were also interested in knowing the missing functionalities that may improve their work. A list of these functionalities grouped by the related work package within GeoKnow is presented in the image below.
This survey allow us to learn from different use cases, main features used, and desired functionalities, that are to be considered in the creation of the GeoKnow Generator. Some important high level findings from the survey were the emphasis in interoperability and reusability through open APIs and approachable visualisation components, support for common geospatial data formats and geodbs, and the necessity of simple tools to support data integration/reuse from geospatial LOD sources.
We also found that some of the ideas of the GeoKnow project are further supported by user requirements like the integration of private and public data and the importance of using the web as an integration platform.
This month the GeoKnow consortium supported and sponsored the European Data Forum (EDF) in Dublin, Ireland. The event aims to bring together academia, industry and the public sector on Big Data topics, as to exchange ideas, showcase new developments and devise actionable roadmaps for strengthening the European data economy.
GeoKnow was presented by members of the AKSW research group, who explained components of the recently started project and discussed a set of interesting topics such as:
- Related efforts in the GIS domain
- Visualization and interlinking of RDF data having multiple spatial relations to different entities (people, documents, buildings, …)
- Synergies with the ongoing LOD2 and recently finished LATC projects.
This time, our private-public co-evolution and interlinking plans attracted most attention, as the integration and synchronization of open data sets with business ones is a crucial problem to many attendees of this forum.
As a result of these discussion we got a few feedback statements, such as
- Address normalization should be considered as part of the integration efforts
- Workflows of how cleaned/reconciled data can be pushed back to the sources should be investigated
The EDF was also an excellent opportunity for networking and learning about topics of interest closely related to our efforts, bearing potential for future collaborations.
Furthermore, congratulations Daimler, who won the first ever “European Data Innovator Award”, which is a price for those that “have shown exceptional vision and execution in the field of integrating and leveraging enterprise (open) data.
We thank all people involved in organizing this great event, especially Elena Simperl and Michael Hausenblas, and we are looking forward to next year’s event held in Athens.
Additionally thanks to all the sponsors who supported this meet-up.
The Open Data on the Web Workshop will be held in London, at the Google Campus, from the 23 – 24 April.
On the agenda are the following topics concerning Open Data:
- transformation (to other formats);
- combinations of data from different models (e.g. linked data and CSV);
- quality assessment and self-description;
- extracting human-readable “stories” from data.
Ontos, represented by Jon Jay Le Grange, will join in and present the technology and aims of the Geoknow project.
Learn more about the workshop and it’s topics.