DOREMUS est un beau projet de recherche regroupant plusieurs producteurs…

Sparnatural as a simple data federation facade
Together with the FAIR-data-evangelists of the MSH Val-de-Loire, we rencently worked on the v2 of the OpenArchaeo portal that uses Sparnatural as its core visual data exploration component (the v2 is not yet visible, but hopefully will be finalized and announced soon !)
It uses a new feature of Sparnatural : the ability to act as a single UI facade to multiple SPARQL endpoints. The user visually writes a single query, the query is sent to multiple data sources, and results are aggregated to be presented to the user in a single result set, that includes each result provenance.
This is depicted in the diagram below:
The user writes his query visually in Sparnatural, here « All archaelogical sites where burials have been found, with the name of their discoverer, if known » :
The visual query is translated into SPARQL, and that SPARQL query is sent to each data source in the federation (actually, the user can select the ones he wants to query). This is depicted by the green arrows in the diagram, numbered « 1 ».
The SPARQL query looks like the following; note how it uses complex CIDOC-CRM property paths, such as the highlighted one, while the visual user query was simple:
Each SPARQL service returns a result. This is depicted by the orange arrows in the diagram, numbered « 2 ».
When every endpoint have answered, their results are aggregated into a single result set. This is number « 3 » in the diagram. During this aggregation, an extra column is added in the result set, containing the name of the source from which the result was retrieved.
The user sees the aggregated result; here, the name of the site, the name of its discovered when known, and the source in which the result was found:
This is possible thanks to the catalog configuration of Sparnatural.
Sparnatural can be passed a catalog of SPARQL endpoints in a federation, and in this case, it will send the same SPARQL query to each, and will aggregate the results. This happens for the final query of course, but also during selection of values in the query UI.
There are two main limits of this approach:
- Limit 1 : all sources in the federation must share the same data model, as the same query is sent to every source
- Limit 2 : each source must be independant : there should be no links from one source to another source so that the query can be solved by each endpoint independantly (so actually, no truly distributed linked data)
Those are the reasons I have entitled the post « simple federation facade ». Those 2 hypothesis are met in OpenArchaeo, and they were also met in the case of the (never released) prototype of the Europeana Linked Data taskforce. If you know other cases of data federation in which this is also true, tell us ! (we could actually try the same on a few DBPedia endpoints using the dbo ontology as a pivot model)
Now guess what ? in Sparnatural we have a « query UI to SPARQL » transformation step, thanks to the SHACL configuration of Sparnatural. Basically we can map a UI property on an underlying property path. Then it would not be too difficult to do this mapping on a source-by-source basis, to have different queries sent to each source, from a single query in the UI. The result set structure would be the same, and result set aggregation can still happen. We would then overcome the first limit described above. That’s the next step !