Author Archives: Hugh Williams

Linked Geospatial Data 2014 Workshop, Part 4: GeoKnow, London, Brussels, The Message

Last Friday (2014-03-14) I (Orri Erling) gave a talk about GeoKnow at the EC Copernicus Big Data workshop. This was a trial run for more streamlined messaging. I have, aside the practice of geekcraft, occupied myself with questions of communication these last weeks.

The clear take-home from London and Brussels alike is that these events have full days and 4 or more talks an hour. It is not quite TV commercial spots yet but it is going in this direction.

If you say something complex, little will get across unless the audience already knows what you will be saying.

I had a set of slides from Jens Lehmann, the GeoKnow project coordinator, for whom I was standing in. Now these are a fine rendition of the description of work. What is wrong with partners, work packages, objectives, etc? Nothing, except everybody has them.

I recall the old story about the journalist and the Zen master: The Zen master repeatedly advises the reporter to cut the story in half. We get the same from PR professionals, “If it is short, they have at least thought about what should go in there,” said one recently, talking of pitches and messages. The other advice was to use pictures. And to have a personal dimension to it.

Enter “Ms. Globe” and “Mr. Cube”. Frans Knibbe of Geodan gave the Linked Geospatial Data 2014 workshop’s most memorable talk entitled “Linked Data and Geoinformatics – a love story” (pdf)about the excitement and the pitfalls of the burgeoning courtship of Ms. Globe (geoinformatics) and Mr. Cube (semantic technology). They get to talking, later Ms. Globe thinks to herself… “Desiloisazation, explicit semantics, integrated metadata…” Mr. Cube, young upstart now approaching a more experienced and sophisticated lady, dreams of finally making an entry into adult society, “critical mass, global scope, relevant applications…” There is a vibration in the air.

So, with Frans Knibbe‘s gracious permission I borrowed the storyline and some of the pictures.

We ought to make a series of cartoons about the couple. There will be twists and turns in the story to come.Mr. Cube is not Ms. Globe’s first lover, though; there is also rich and worldly Mr. Table. How will Mr. Cube prove himself? The eternal question… Well, not by moping around, not by wise-cracking about semantics, no. By boldly setting out upon a journey to fetch the Golden Fleece from beyond the crashing rocks. “Column store, vectored execution, scale out, data clustering, adaptive schema…” he affirms, with growing confidence.

This is where the story stands, right now. Virtuoso run circles around PostGIS doing aggregations and lookups on geometries in a map-scrolling scenario (GeoKnow’s GeoBenchLab). VirtuosoSPARQL outperforms PostGIS SQL against planet-scale OpenStreetMap; Virtuoso SQL goes 5-10x faster still.

Mr Cube is fast on the draw, but still some corners can be smoothed out.

Later in GeoKnow, there will be still more speed but also near parity between SQL and SPARQL via taking advantage of data regularity in guiding physical storage. If it is big, it is bound to have repeating structure.

The love story grows more real by the day. To be consummated still within GeoKnow.

Talking of databases has the great advantage that this has been a performance game from the start. There are few people who need convincing about the desirability of performance, as this also makes for lower cost and more flexibility on the application side.

But this is not all there is to it.

In Brussels, the public was about E-science (Earth observation). In science, it is understood that qualitative aspects can be even more crucial. I told the story about an E-science-oriented workshop I attended in America years ago. The practitioners, from high energy physics to life sciences to climate, had invariably come across the need for self-description of data and for schema-last. This was essentially never provided by RDF, except for some life science cases. Rather, we had one-off schemes, ranging from key-value pairs to putting the table name in a column of the same table to preserve the origin across data export.

Explicit semantics and integrated metadata are important, Ms. Globe knows, but she cannot sacrifice operational capacity for this. So it is more than a DBMS or even data model choice — there must be a solid tool chain for data integration and visualization. GeoKnow provides many tools in this space.

Some of these, such as the LIMES entity matching framework (pdf) are probably close to the best there is. For other parts, the SQL-based products with hundreds of person years invested in user interaction are simply unbeatable.

In these cases, the world can continue to talk SQL. If the regular part of the data is in fact tables already, so much the better. You connect to Virtuoso via SQL, just like to PostGIS or Oracle Spatial, and talk SQL MM. The triples, in the sense of flexible annotation and integrated metadata, stay there; you just do not see them if you do not want them.

There are possibilities all right. In the coming months I will showcase some of the progress, starting with a detailed look at the OpenStreetMap experiments we have made in GeoKnow.

Linked Geospatial Data 2014 Workshop posts:

Linked Geospatial Data 2014 Workshop, Part 3: The Stellar Reach of OKFN

The Open Knowledge Foundation (OKFN) held a London Open Data Meetup in the evening of the first day of the Linked Geospatial Data 2014 workshop. The event was, as they themselves put it, at the amazing open concept office of OKFN at the Center for Creative Collaboration in Central London. What could sound cooler? True, OKFN threw a good party, with ever engaging and charismatic founder Rufus Pollock presiding. Phil Archer noted, only half in jest, that OKFN was so influential, visible, had the ear of government and public alike, etc., that it put W3C to shame.

Now, OKFN is a party in the LOD2 FP7 project, so I have over the years met people from there on and off. In LOD2, OKFN is praised to the skies for its visibility and influence and outreach and sometimes, in passing, critiqued for not publishing enough RDF, let alone five star linked data.

As it happens, CSV rules, and even the W3C will, it appears, undertake to standardize a CSV-to-RDF mapping. As far as I am concerned, as long as there is no alignment of identifiers or vocabulary, whether a thing is CSV or exactly equivalent RDF, there is no great difference, except that CSV is smaller and loads into Excel.

For OKFN, which has a mission of opening data, insisting on any particular format would just hinder the cause.

What do we learn from this? OKFN is praised not only for government relations but also for developer friendliness. Lobbying for open data is something I can understand, but how do you do developer relations? This is not like talking to customers, where the customer wants to do something and it is usually possible to give some kind of advice or recommendation on how they can use our technology for the purpose.

Are JSON and Mongo DB the key? A well renowned database guy once said that to be with the times, JSON is your data model, Hadoop your file system, Mongo DB your database, andJavaScript your language, and failing this, you are an old fart, a legacy suit, well, some uncool fossil.

The key is not limited to JSON. More generally, it is zero time to some result and no learning curve. Some people will sacrifice almost anything for this, such as the possibility of doing arbitrary joins. People will even write code, even lots of it, if it only happens to be in their framework of choice.

Phil again deplored the early fiasco of RDF messaging. “Triples are not so difficult. It is not true that RDF has a very steep learning curve.” I would have to agree. The earlier gaffes of theRDF/XML syntax and the infamous semantic web layer cake diagram now lie buried and unlamented; let them be.

Generating user experience from data or schema is an old mirage that has never really worked out. The imagined gain from eliminating application writing has however continued to fascinate IT minds and attempts in this direction have never really ceased. The lesson of history seems to be that coding is not to be eliminated, but that it should have fast turnaround time and immediately visible results.

And since this is the age of data, databases should follow this lead. Schema-last is a good point, maybe adding JSON alongside XML as an object type in RDF might not be so bad. There are already XML functions, so why not the analog for JSON? Just don’t mention XML to the JSON folks…

How does this relate to OKFN? Well, in the first instance this is the cultural impression I received from the meetup, but in a broader sense these factors are critical to realizing the full potential of OKFN’s successes so far. OKFN is a data opening advocacy group; it is not a domain-specific think tank or special interest group. The data owners and their consultants will do analytics and even data integration if they see enough benefit in this, all in the established ways. However, the widespread opening of data does create possibilities that did not exist before. Actual benefits depend in great part on constant lowering of access barriers, and on a commitment by publishers to keep the data up to date, so that developers can build more than just a one-off mashup.

True, there are government users of open data, since there is a productivity gain in already having the neighboring department’s data opened to a point; one does no longer have to go through red tape to gain access to it.

For an application ecosystem to keep growing on the base of tens of thousands of very heterogeneous datasets coming into the open, continuing to lower barriers is key. This is a very different task from making faster and faster databases or of optimizing a particular business process, and it demands different thinking.

Linked Geospatial Data 2014 Workshop posts:

Linked Geospatial Data 2014 Workshop, Part 2: Is SPARQL Slow?

I had a conversation with Andy Seaborne of Epimorphics, initial founder of the Jena RDF Framework tool chain and editor of many W3C recommendations, among which the two SPARQLs. We exchanged some news; I told Andy about our progress in cutting the RDF-to-SQL performance penalty and doing more and better SQL tricks. Andy asked me if there were use cases doing analytics over RDF, not in the business intelligence sense, but in the sense of machine learning or discovery of structure. There is, in effect, such work, notably in data set summarization and description. A part of this has to do with learning the schema, like one would if wanting to put triples into tables when appropriate. CWI in LOD2 has worked in this direction, as has DERI(Giovanni Tummarello‘s team), in the context of giving hints to SPARQL query writers. I would also mention Chris Bizer et al., at University of Mannheim, with their data integration work, which is all about similarity detection in a schema-less world, e.g., the 150M HTML tables in the Common Crawl, briefly mentioned in the previous blogJens Lehmann from University of Leipzig has also done work in learning a schema from the data, this time in OWL.

Andy was later on a panel where Phil Archer asked him whether SPARQL was slow by nature or whether this was a matter of bad implementations. Andy answered approximately as follows: “If you allow for arbitrary ad hoc structure, you will always pay something for this. However, if you tell the engine what your data is like, it is no different from executing SQL.” This is essentially the gist of our conversation. Most likely we will make this happen via adaptive schema for the regular part and exceptions as quads.

Later I talked with Phil about the “SPARQL is slow” meme. The fact is that Virtuoso SPARQL will outperform or match PostGIS SQL for Geospatial lookups against the OpenStreetMap dataset. Virtuoso SQL will win by a factor of 5 to 10. Still, the SPARQL is slow meme is not entirely without a basis in fact. I would say that the really blatant cases that give SPARQL a bad name are query optimization problems. With 50 triple patterns in a query there are 50-factorial ways of getting a bad plan. This is where the catastrophic failures of 100+ times worse than SQL come from. The regular penalty of doing triples vs tables is somewhere between 2.5 (Star Schema Benchmark) and 10 (lookups with many literals), quite acceptable for many applications. Some really bad cases can occur with regular expressions on URI strings or literals, but then, if this is the core of the application, it should use a different data model or an n-gram index.

The solutions, including more dependable query plan choice, will flow from adaptive schema which essentially reduces RDF back into relational, however without forcing schema first and with accommodation for exceptions in the data.

Phil noted here that there already exist many (so far, proprietary) ways of describing the shape of a graph. He said there would be a W3C activity for converging these. If so, a vocabulary that can express relationships, the types of related entities, their cardinalities, etc., comes close to a SQL schema and its statistics. Such a thing can be the output of data analysis, or the input to a query optimizer or storage engine, for using a schema where one in fact exists. Like this, there is no reason why things would be less predictable than with SQL. The idea of a re-convergence of data models is definitely in the air; this is in no sense limited to us.

Linked Geospatial Data 2014 Workshop posts:

Linked Geospatial Data 2014 Workshop, Part 1: Web Services or SPARQL Modeling?

The W3C (World Wide Web Consortium) and OGC (Open Geospatial Consortium) organized the Linked Geospatial Data 2014 workshop in London this week. The GeoKnow project was represented by Claus Stadler of Universität Leipzig, and Hugh Williams and myself (Orri Erling) from OpenLink Software. The Open Knowledge Foundation (OKFN) also held an Open Data Meetup in the evening of the first day of the workshop.

Reporting on each talk and the many highly diverse topics addressed is beyond the scope of this article; for this you can go to the program and the slides that will be online. Instead, I will talk about questions that to me seemed to be in the air, and about some conversations I had with the relevant people.

The trend in events like this is towards shorter and shorter talks and more and more interaction. In this workshop, talks were given in series of three talks with all questions at the end, with all the presenters on stage. This is not a bad idea since we get a panel-like effect where many presenters can address the same question. If the subject matter allows, a panel is my preferred format.

Web services or SPARQL? Is GeoSPARQL good? Is it about Linked Data or about ontologies?

Geospatial data tends to be exposed via web services, e.g., WFS (Web Feature Service). This allows item retrieval on a lookup basis and some predefined filtering, transformation, and content negotiation. Capabilities vary; OGC now has WFS 2.0, and there are open source implementations that do a fair job of providing the functionality.

Of course, a real query language is much more expressive, but a service API is more scalable, as people say. What they mean is that an API is more predictable. For pretty much any complex data task, a query language is near-infinitely more efficient than going back-and-forth, often on a wide area network, via an API. So, as Andreas Harth put it: for data publishers, make an API; an open SPARQL endpoint is too “brave,” [Andreas' word, with the meaning of foolhardy]. When you analyze, he continued, then you load it into a endpoint, but you use your own. Any quality of service terms must be formulated with respect to a fixed workload, this is not meaningful with ad hoc queries in an expressive language. Things like anytime semantics (return whatever is found within a time limit) are only good for a first interactive look, not for applications.

Should the application go to the data or the reverse? Some data is big and moving it is not self-evident. A culture of datasets being hosted on a cloud may be forming. Of course some linked data like DBpedia has for a long time been available as Amazon images. Recently, SindiceTech has made a similar packaging of Freebase. The data of interest here is larger and its target audience is more specific, on the e-science side.

How should geometries be modeled? I have met the GeoSPARQL and the SQL MM on which it is based with a sense of relief, as these are reasonable things that can be efficiently implemented. There are proposals where points have URIs, and linestrings are ordered sets of points, and collections are actual trees with RDF subjects as nodes. As a standard, such a thing is beyond horrible, as it hits all the RDF penalties and overheads full force, and promises easily 10x worse space consumption and 100x worse run times compared to the sweetly reasonable GeoSPARQL. One presenter said that cases of actually hanging attributes off points of complex geometries had been heard of but were, in his words, anecdotal. He posed a question to the audience about use cases where points in fact needed separately addressable identities. Several cases did emerge, involving, for example, different measurement certainties for different points on on a trajectory trace obtained by radar. Applications that need data of this sort will perforce be very domain specific. OpenStreetMap (OSM) itself is a bit like this, but there the points that have individual identity also have predominantly non-geometry attributes and stand for actually-distinct entities. OSM being a practical project, these are then again collapsed into linestrings for cases where this is more efficient. The OGC data types themselves have up to 4 dimensions, of which the 4th could be used as an identifier of a point in the event this really were needed. If so, this would likely be empty for most points and would compress away if the data representation were done right.

For data publishing, Andreas proposed to give OGC geometries URIs, i.e., the borders of a country can be more or less precisely modeled, and the large polygon may have different versions and provenances. This is reasonable enough, as long as the geometries are big. For applications, one will then collapse the 1:n between entity and its geometry into a 1:1. In the end, when you make an application, even an RDF one, you do not just throw all the data in a bucket and write queries against that. Some alignment and transformation is generally involved.

Linked Geospatial Data 2014 Workshop posts:

GeoKnow First Year Benchmark Results

A GeoKnow Benchmarking laboratory has been setup for comparison of benchmark results over the duration of the GeoKnow project. In it current form this takes for the form of original the LOD2 project GeoBench program which has been taken over and adopted for use in the GeoKnow project and available from the GeobenchLab GIT repo.

The current improvements in the GeoBench program are primarily related to the expansion of the benchmark, in order to make it employable not only to RDF data, but to relational data as well. This will open opportunities for a performance comparison between RDF and relational spatial data management  systems.

Below are some comparison results using the planet wide Open Street Map (OSM) datasets hosted in both Virtuoso and PostGIS.

Screenshot 2014-01-20 11.02.07

In summary, the result demonstrate that Virtuoso in both SQL and SPARQL outperformed PostGIS by significant factor. Specifically, all the queries in the power run were executed 33 times slower in PostGIS than in Virtuoso SQL (single server). If we compare PostGIS with Virtuoso SPARQL, the factor will be even greater: 131 for low zoom level queries, 23 for high zoom level queries, and 113 in total. If we correlate Virtuoso SPARQL and SQL (single server), we will conclude that the relational version is slower almost 4 times on low zoom level queries, while it is faster 23% on high zoom level. In total, SQL version is slower more than 3 times.

The full  descriptions and results of these benchmarks can be found in the GeoKnow D1.3.2 - Continuous Report on Performance Evaluation deliverable.