Geospatial information extraction and management
One of the main contributions of GeoKnow was to take geospatial data out of GIS for making it accessible on the web. This will allow the access to non experts, and to have a self-explanatory spatial information structures accessible via standard Web protocols and support for ad-hoc definable, flexible information structures. GeoKnow developed and improved Sparqlify and TripleGeo. These tools allow for transforming geospatial data from several conventional formats, into RDF triples, compliant with several standards (GeoSPARQL, Virtuoso vocabulary, etc). Sparqlify has been tested for mapping OpenStreetMap (OSM) database to RDF, and TripleGeo supports several different geospatial databases (PostgreSQL, Oracle, MySQL, IBM DB2) , and also file formats (ESRI shapes, GML, KML).
As the need for location data within Linked Data applications has increased it has accordingly been a requirement for RDF Triple stores to support multiple geometries at Web scale. At the beginning of the GeoKnow project the Virtuoso RDF QUAD store only supported the point Geometry type and durning the course of the project it has been enhanced to support some 14 additional Geometry types (Pointlist, Ring, Box, Linestring, Multilinestring, Polygon, Multipolygon, Collection, Curve, Closedcurve, Curvepolygon and Multicurve) and associated functions, with near full compliance with the GeoSPARQL / OGC standards now in place. Support for the GEOS (Geometry Engine – Open Source) Library has been implemented in Virtuoso, enhancing its Geospatial capabilities further.
The Virtuoso query optimiser has been enhanced to improve geospatial query performance including parallelisation of their execution. In addition improvements have been made to the RDF storage optimisation reorganising physical data placement according to geospatial properties implementing a structured-aware RDF using Characteristic Sets. Over the course of the project annual benchmarking of the Virtuoso QUAD store have been performed to demonstrate the improvements in the state of the art in Geospatial querying by using the GeoBench tool. In the very last report Virtuoso was benchmarked using the newer OSM dataset. In the Description of Work, a scalability up to 25 billion triples and query times below 0.5s was envisioned, but presented results used more than 50 billion triples, and average query execution times of 0.46s for a power run, thus, results have exceeded expectations.
Spatial knowledge aggregation and fusing
GeoKnow project also aims at enriching the web of data with the geospatial dimension, so it has contributed with the development of interlinking and fusing methods adopted to spatial information. Two of the first tools achieving this are Data Extraction and Enrichment Framework (DEER)}, formerly known as GeoLift, and LIMES. DEER adds the spatial dimension in a dataset describing locations found in the links or in unstructured data (using a NLP component). LIMES was extended to enable linking within complex resources in geo-spatial data sets (e.g., polygons, line-strings, etc.). Furthermore, the improvement of these geo-linking tools were extended to scale using map-reduce algorithms in a cloud-based architecture. For evaluating these developments, the corresponding benchmarks were created (see \ref{sec:benchmarking}). This experimental benchmark consisted of linking cities from DBpedia and LinkedGeoData. Initial results suggested that by using a geospatial dimension and a mean distance when linking datasets, a perfect linking accuracy could be achieved. The result of this research was accredited with the Best Research Paper award at ESWC 2013.
For working directly with geometries, FAGI framework was created for fusing different RDF representations of geometries into a consistent map. This tool receives as input two datasets and a set of links that interlink entities between the datasets and produces a new dataset where each pair of linked entities is fused into a single entity. The fusion is performed for each pair of matched properties between two linked entities, according to a selected fusion action, and considers both spatial and non-spatial properties (metadata). Fusing geospatial data may lead to a very time consuming process. Thus, improvements were proposed and implemented for optimising several processes (focusing on the minimisation of data transfer and the exploitation of graph-joining functionality) and a benchmark was designed to evaluate those improvements\footnote. FAGI was also extended with additional functionality that support exploration, manual authoring, several options for batch fusion actions and link discovery and learning mechanisms for recommending fusion actions and annotation classes for fused entities.
Quality Assessment
GeoKnow also worked on providing tools to improve the quality of existing datasets. The OSM community constantly contributes with enrichment and enhancement of OSM maps, and provided the needed tools for doing so. GeoKnow contributed in improving the quality of the annotations providing by the users by generating classification and clustering models in order to recommend categories for new entities inserted into OSM. OSMRec, the tool developed for this aim, can be used for recommending OSM categories for newly created geospatial entities, based on already existing annotated entities in OSM. Other data quality assessment on geospatial data was investigated, first by identifying the metrics that can be used to asses the data pertaining to various aspects such as coverage, surface area and structuredness. These metrics were used to evaluate community-generated datasets. The metrics outcome was used to create two software tools for assessing the quality of the datasets. CROCUS produces statistics about the data, it generates three types of Data Cubes, where the first Data Cube refers to the accuracy, second and third DataCube addresses the completeness and consistency of spatial data. And, the GeoKnow Quality Evaluator (GQE), reuses CROCUS and implements a set of geospatial data quality metrics (e.g., dataset coverage, coherence, avg. polygons/per class, etc) to compare different datasets across these metrics. These tools were used to evaluate three different datasets: LinkedGeoData, NUTS, GeoLinkedData. The results from this evaluation helped to understand the overall structure of the datasets and the variety of the data. Another data assessment tool created in GeoKnow was the RDF Data Validation Tool, which is based on integrity constraints defined by the Integrity constraints defined by the RDF Data Cube vocabulary, and is focused on statistical data.
Visualisation and Data Curation
The exploration and visualisation of data is a crucial task for final users. GeoKnow aimed at creating maps that are dynamically enriched and adopted to the needs of special user communities. Thus, modern software frameworks were explored to support the creation of such interfaces. GeoKnow developed reusable JavaScript libraries for interfacing with SPARQL endpoints. These libraries were used for instance in Mappify, which is a tool for easily generating and sharing maps as widgets, and Facete, which is a faceted browser for RDF spatial and non-spatial data enhanced with editing support. The editing capabilities consist basically in the definition of the interaction between an endpoint and the UI (Facete). The RDF Edit eXtension (REX) tool interface was implemented to support two kinds of data editing, one dealing with geospatial data on a map, and other for editing triples. Furthermore, Lodtenant was developed to support curating RDF data by means of workflows realised as batch processes. After data curation process, one may require the possibility of saving changes for using them later or propagating them to the other datasets. One of the Unister requirements consisted in the capability of managing and synchronising changes between different versions of private and public interlinked datasets\footnote{\url{http://svn.aksw.org/projects/GeoKnow/Public/D4.3.1_Concept_for_public-private_Co-Evolution.pdf}}. This requirement derived the deployment of the Co-Evolution Service component, which is a web application with a REST interface that allow managing dataset changes.
Another visualization component was developed for visualising spatio-temporal data. \textit{Exploratory Spatio-Temporal Analysis of Linked Data ESTA-LD} is a tool for spatiotemporal analysis of linked statistical data that appear at different levels of granularity. Finally, Mobile-based visualisation was also covered in GeoKnow. The GEM application allows to perform faceted browsing fully exploits the Linked Open Data paradigm. This tool allows browsing any number of SPARQL endpoints and filtering resources based on their type and constraints on properties, as well as leverage GPS positioning to deliver semantic routing.
GeoKnow Generator Workbench
The GeoKnow Generator Workbench provides an unified interface and data access to most of the tools described earlier in this section, and is available online to test here. It enables simple access and interaction with the different components needed in the LD Lifecycle. This Workbench was designed under the requirements specification of the GeoKnow use cases. In general, these requirements include:
- Scalability for working with large data sets
- Authentication, Authorisation and Role Management as a primary requirement in companies
- Data Provenance tracking for traceability of changes
- Job Monitoring and Robustness for applicability in production
- Modularity and Composability in order to provide flexibility w.r.t. integrating linked data tools