GeoKnow was present at the Extended Semantic Web Conference (which took place in Portoroz, Slovenia) in many ways. In particular, we presented a novel approach for automating RDF dataset transformation and enrichment in the machine learning track of the main conference. The presentation was followed by a constrictive discussion of results and insights plus hints pertaining to further possible applications of the approach.
Manifold RDF data contain implicit references to geographic data. For example, music datasets such as Jamendo include references to locations of record labels, places where artists were born or have been, etc. The aim of the spatial mapping component, dubbed GeoLift, is to retrieve this information and make it explicit.
Geographical information can be mentioned in three different ways within Linked Data:
- Through dereferencing: Several datasets contain links to datasets with explicit geographical information such as DBpedia or LinkedGeoData. For example, in a music dataset, one might find information such as http://example.org/Leipzig owl:sameAs http://dbpedia.org/resource/Leipzig. We call this type of reference explicit. We can now use the semantics of RDF to fetch geographical information from DBpedia and attach it to the resource in the other ontology as http://example.org/Leipzig and http://dbpedia.org/resource/Leipzig refer to the same realworld object.
- Through linking: It is known that the Web of Data contains an insufficient number of links. The latest approximations suggest that the Linked Open Data Cloud alone consists of 31+ billion triples but only contains approximately 0.5 billion links (i.e., less than 2% of the triples are links between knowledge bases). The second intuition behind GeoLift is thus to use link discovery to map resources in an input knowledge base to resources in a knowledge that contains explicit geographical information. For example, given a resource http://example.org/Athen, GeoLift should aim to find a resource such as http://dbpedia.org/resource/Athen to map it with. Once having established the link between the two resources, GeoLift can then resolve to the approach defined above.
- Through Natural Language Processing: In some cases, the geographic information is hidden in the objects of data type properties. For example, some datasets contain biographies, textual abstracts describing resources, comments from users, etc. The idea here is to use this information by extracting Named Entities and keywords using automated Information Extraction techniques. Semantic Web Frameworks such as FOX have the main advantage of providing URIs for the keywords and entities that they detect. These URIs can finally be linked with the resources to which the datatype properties were attached. Finally, the geographical information can be dereferenced and attached to the resources whose datatype properties were analyzed.
The idea behind GeoLift is to provide a generic architecture that contains means to exploit these three characteristics of Linked Data. In the following, we present the technical approach underlying GeoLift.
GeoLift was designed to be a modular tool which can be easily extended and re-purposed. In its first version, it provides two main types of artifacts:
- Modules: These artifacts are in charge of generating geographical data based on RDF data. To this aim, they implement the three intuitions presented above. The input for such a module is an RDF dataset (in Java, a Jena Model ). The output is also an RDF dataset enriched with geographical information (in Java, an enriched Jena Model ).
- Operators: The idea behind operators is to enable users to define a workflow for processing their input dataset. Thus, in case a user knows the type of enrichment that is to be carried out (using linking and then links for example), he can define the sequence of modules that must be used to process his dataset. Note that the format of the input and output of modules is identical. Thus, the user is empowered to create workflows of arbitrary complexity by simply connecting modules.
The corresponding architecture is shown below. The input layer allows reading RDF in different serializations. The enrichment modules are in the second layer and allow adding geographical information to RDF datasets by different means. The operators (which will be implemented in the future version of GeoLift) will combine the enrichment modules and allow defining a workflow for processing information. The output layer serializes the results in different format. The enrichment procedure will be monitored by implementing a controller, which will be added in the future version of GeoLift.