Sparna Blog » json-ld

European Parliament Open Data Portal : a SHACL-powered knowledge graph

Marie Muller — Wed, 09 Apr 2025 14:10:12 +0000

A second usecase Thomas wrote for Veronika Heimsbakk’s SHACL for the Practitioner upcoming book is about Sparna’s work for the European Parliament.

From validation of the data in the knowledge graph to further projects of data integration and dissemination, many different usages of SHACL specifications were explored…

… and more exploratory usages of SHACL are foreseen !

“

A knowledge-graph powered open data portal

The European Parliament Open Data Portal (EPODP) went live in January 2023. Its particularity is that it is not a mere aggregation of documents or dump files from business applications in custom formats; but rather a collection of datasets each extracted from a central semantic knowledge graph, itself aggregating data migrated from approximately twenty business applications. The result is a semantically interoperable open data portal : the semantic of its data model is clearly defined and documented, and reuses widely deployed existing ontologies. It already provides its data to different consumers (most notably the europarl website and the EU law tracker) in a context of cross-institutions interoperability. The data captures the activity of the parliament : as co-legislator together with the Council of the EU, the European Parliament (EP) holds plenary sittings, in which reports originating from committees, as well as motion for resolutions, are amended and voted; after the vote, the final adopted texts are published.

The focus on semantic interoperability of EPODP maximizes the potential of reuse and linkage of its datasets, and maximizes the quality of the offered data. It comes however at a cost when building the portal : deep analysis and understanding of the existing data and documents structure is required to capture the business semantic. SHACL is the way to formally encode this business semantic – but how is it deployed in practice ? how is it maintained ? what are the different types of SHACL specifications used ?

SHACL at the center of a model-driven approach

SHACL in the EPODP is at the basis of multiple model-driven usages depicted in the following diagram:

There was two key drivers for introducing the use of SHACL in the EPODP project : validation of the data in the knowledge graph, and generation of public documentations of the models. The same SHACL specification that captures the business semantic is directly actionable to be published as a documentation and to validate the data. The produced documentation is a set of public files, such as the ELI-EP application profile documentation and others accessible from the EPODP developer’s corner. The SHACL Play documentation generator is used to produce the documentation pages. Data validation happens at earlier stages, after data transformation steps.

Two additional usages of SHACL specifications were explored : one was to generate SPARQL queries to extract the content of datasets from the larger knowledge graph. The SHACL specification of a dataset content is interpreted to generate SPARQL CONSTRUCT queries, executed against the entire knowledge graph, to return a subset of data corresponding to the specification. The query generation was implemented in SHACL Play, however the EPODP chose to continue using manually crafted SPARQL queries to generate the datasets. The other usage was to complement the SHACL specifications with the mapping rules used to feed the corresponding properties or classes in the graph. This has the advantage that the mapping rules are documented and maintained alongside the specification and not in a separate document. This work is ongoing.

More exploratory usages of SHACL are foreseen : generating a query user interface based on the SHACL specification, using the Sparnatural query builder, and also input forms to facilitate the creation of DCAT datasets descriptions. Additionally, automated generation of the JSON-LD context and the JSON schema of the API are foreseen.

Not « 1 SHACL to rule them all », but application profiles, dataset definitions, and migration specifications

The definition of the EPODP knowledge graph is not captured in a single SHACL specification, but rather in three different application profiles, each being a selection of classes and properties of one sub-domain : ELI-EP covers the description of documents and activities, ORG-EP covers the definitions of EP organisations (such as committees, political groups, etc.) and members of the parliament, and SKOS-EP covers how controlled vocabularies are structured. In addition, DCAT-EP is the specification for how dataset records are described in the EPODP catalog – but this is not part of the knowledge graph per se.

Together, ELI-EP, ORG-EP and SKOS-EP specify the structure of the entire knowledge graph from which the datasets are extracted. In addition, the structure of each dataset family available in the EPODP (such as adopted texts, plenary documents, parliamentary questions, etc.) is also described in SHACL, referred to as « DSD » for « Dataset Definition ». While the application profiles describe every possible properties on generic shapes, the DSDs will specify only the subset of properties used in a dataset, with possibly different cardinalities or range. For example, ELI-EP specifies that « a Work may have the property eli:adopts« (with no minimum cardinality (eli:adopts is defined as « Indicates that the work represents the adopted work of one or several related works »). The DSD for adopted texts datasets specifies the shape of « Adopted texts » as a subset of the Works, and indicates that the minimum cardinality of eli:adopts is 1 for this particular subset. Besides, some properties, such as eli:amends are not available for adopted texts, thus not declared in the DSD.

In addition, specifications of the conversion of some data sources are also specified in independent SHACL files. The articulations of these 3 kinds of SHACL files and the reused ontologies is depicted in the following diagram:

There is currently no reuse or reference of shapes across the different specifications. Each is independent. A nice improvement would be to study how SHACL DSDs could be derived from the application profile SHACL, without redeclaring the identical constraints.

Editing SHACL in spreadsheets

In total 16 SHACL specifications are currently published in the EPODP, and around 80 are used to validate data migrated from each individual sources. The first step in the specification of each model is the design in a diagram such as the ones visible in the public documentations of the models. The EPODP team is then using spreadsheets to encode the specifications, adapted from the one provided in the SHACL Play suite. The spreadsheet is converted to SHACL using the xls2rdf converter. Spreadsheets provide a simple editing solution, with an easy learning curve, made even easier with a few formulas to compute cell values automatically. It even provides ways for editing advanced patterns (such as the ability to directly turtle lists for sh:or, or blank nodes for property paths), but of course still limits the expressivity. The following screenshot shows how property shapes look like in the spreadsheet:

Results and future perspectives

The EPODP use-case shows how SHACL can be applied in a systematic way in a data integration and dissemination project : at the data transformation step, at the knowledge graph level, and at the data dissemination. Public documentation, data validation, data extraction are tasks that can be be automated based on a SHACL specification. While the context is one of a large public institution, the same approach can be applied in industrial contexts. The SHACL specifications are a cornerstone of such projects, enabling semantic interoperability at large and a mutual understanding between business experts, data analysts, developers, and data consumers.

”

Veronika’s book will be divided into three parts :

1. Back to Basics
Introduction to logic and RDF, brief skimming of the topics. Also covering various world assumptions.

2. Getting to know the stuff
Introduction to SHACL, including core, sh-sparql, advanced features.

3. Working with the stuff
SHACL Stories. Use cases, user stories and implementations.

Image : © European Union, [2024] – EP

Cet article European Parliament Open Data Portal : a SHACL-powered knowledge graph est apparu en premier sur Sparna Blog.

Clean JSON(-LD) from RDF using Framing

Thomas Francart — Wed, 20 Jul 2022 06:56:36 +0000

Say you have a nice RDF knowledge graph based on an ontology, or maybe reusing ontologies, and maybe you have specified the structure of the knowledge graph with SHACL. And now you would like to expose your RDF in JSON in an API, for the average developer (or maybe you would like to produce a clean JSON to be indexed by Elastic). And the average developer (or Elastic) does not care about RDF and does not care about the “-LD” in “JSON-LD”, he just cares about JSON; and he is right ! we are here to care about the “-LD” part for him.

So what you need is to produce a clean JSON structure from your raw RDF triples. And when I mean “clean”, I mean :

no URIs. Nowhere. No URIs in JSON keys, No URIs in types of entities, no URIs in the value of properties controlled by a closed list; the only places where it is acceptable to see a URI are : to give the id of the entities, and when making a reference to such an id of entity within the graph; even in these cases the URIs can be shortened.
no fancy JSON-LD keys like @type, @value, @datatype, @id, etc.
indented.

You have 2 possibilities to do that :

You develop a custom script, to either generate a JSON export of your data, or to implement the API that will query the knowledge graph, parse the triples, and generate that clean JSON output.
You use JSON-LD framing to automate the production of a clean JSON(-LD) from RDF.

There are 2 nice things about the solution with JSON-LD framing :

it can be automated
you automatically retain the RDF compatibility – because your JSON will necessarily be JSON-LD. This means you can import your nice JSON directly in a triplestore.

The principle of JSON-LD framing is that you provide a JSON-LD @context with an additionnal frame specification that defines how the JSON should be structured (indented), which entity to include at each level (entities can be filtered based on some criteria), and also which properties to include in each entity.

To start with JSON-LD framing, what you need is JSON-LD. Any JSON-LD. Typically the raw JSON-LD serialization that any RDF library or triplestore will produce; that kind of ugly, messy, full-of-URIs-and-@language kind of JSON. So something like:

(Brrr, scary, no ?)

And then what you need is the JSON-LD playground with the “Framed” tab. This will allow you to test your context and frame specification.

And when deployed in production, what you will need is a JSON-LD library that is capable of implementing the JSON-LD framing algorithm. Implementations are listed here, and you need an implementation compatible with JSON-LD 1.1.

Example files

As an example, I use a JSON-LD file from the French National Library, the one from Les Misérables here : https://data.bnf.fr/fr/13516296/victor_hugo_les_miserables/ (download link at the bottom of the page).

You can download the initial JSON example, the frame specification, and the result in a zip. The zip also contains intermediate frame specifications.

The @context

We’ll start by specifying the JSON-LD context part.

Map @type to type and @id to id

Average developer will wonder what are those @type and @id keys. Re-map them straight away to type and id:

"type" : "@type", "id" : "@id",

Schema.org and lot of other specifications do that.

What about @graph ?

If you have a named graph at the top, introduced by @graph, my suggestion would be to simply remap it to a fixed key, like « data », or « entities » :

"data" : "@graph",

Map RDF properties URIs to JSON keys

Get rid of any trace of URI or short URIs in JSON keys. Declare a term for every property in your graph. The simplest way to do this is to use the local part of the URI (after last “#” or “/”) as the term. Order the context by the alphabetical order of the terms. Terms for properties will usually start with a lowercase letter.

In corner cases you may end up with the same term (such as in the example bnf-onto:subject and dcterms:subject), so in that case you need a different key, I chose “bnf-subject” here for bnf-onto:subject and kept “subject” for dcterms:subject.

"creator" : "dcterms:creator", "date" : "dcterms:date", "dateOfWork" : "rdagroup1elements:dateOfWork", "depiction" : "foaf:depiction", "description" : "dcterms:description",

Map classes URIs to JSON terms

Now you want to do the same thing to get rid of any trace of URIs in the “type” of entities. Declare a term for every class in your ontology/application profile. List the classes in a different section than the properties. Terms for classes will usually start with an uppercase.

"Concept" : "skos:Concept", "Document" : "foaf:Document", "ExpositionVirtuelle" : "bnf-onto:ExpositionVirtuelle",

Declare object properties with “@type”: “@id”

Now you want to get rid of all those ugly “id”, we are only interested in listing the values. To do that, modify the mapping of the property (here “depiction”) to state its values are URIs. You need to change the mapping from

"depiction" : "foaf:depiction",

"depiction" : { "@value" : "foaf:depiction", "@type":"@id" },

And so parts like this :

"depiction": [ { "id": "https://gallica.bnf.fr/ark:/12148/btv1b8438568p.thumbnail" }, { "id": "https://gallica.bnf.fr/ark:/12148/btv1b9004781d.thumbnail" }, { "id": "https://gallica.bnf.fr/ark:/12148/bpt6k5545348q.thumbnail" } ]

Will be turned into

"depiction": [ "https://gallica.bnf.fr/ark:/12148/btv1b8438568p.thumbnail", "https://gallica.bnf.fr/ark:/12148/btv1b9004781d.thumbnail", "https://gallica.bnf.fr/ark:/12148/bpt6k5545348q.thumbnail", "https://gallica.bnf.fr/ark:/12148/btv1b8438570r.thumbnail" ]

Map datatypes

Now you want to get rid of the @datatype information for literals. If the value of a property always uses the same datatype, which is the case 99,9% of the time, then you can change the mapping from

"property" : "http://myproperty",

"property" : { “@id”: "http://myproperty", “@type”:”xsd:date” }

(The example used does not have datatype properties.)

Map languages, with fixed language or when multilingual

Now let’s get rid of the @language. For this you have 2 choices : when the language is always the same for the value, you can indicate it in the context, the same way that you would do for the datatype but with the @language key. So you change from

"description" : "dcterms:description",

"description" : { “@id” : "dcterms:description", “@language” : “fr” }

You could even have different terms for different languages, such as :

"title_fr" : { "@id" : "dcterms:title", "@language" : "fr" }, "title_en" : { "@id" : "dcterms:title", "@language" : "en" }, "title" : { "@id" : "dcterms:title" },

or when you have multilingual multiple values, you can make the property a language map by declaring it this way:

"editorialNote" : { "@id" : "skos:editorialNote", "@container" : "@language" },

Which will turn the language code as a key in the JSON output:

"editorialNote": { "fr": [ "BN Cat. gén. (sous : Hugo, comte Victor-Marie) : Les misérables. - . - BN Cat. gén. 1960-1969 (sous : Hugo, Victor) : idem. - . -", "Laffont-Bompiani, Oeuvres, 1994. - . - GDEL. - . -" ] },

In that case, watch out for cases where there is a value without language, it will generate a @none key.

Map controlled list values to JSON terms

By now you already get a much cleaner JSON and almost all “unnecessary” URIs have disappeared. But we still have some URI references that we can clean up : the ones that are references to controlled lists with a finite number of values.

We can declare term mappings for those values just like we did to map properties and classes. BUT – and this is the trick, we need to change the property declaration from “@id” to “@vocab” for the replacement to happen. This is documented in the « Type coercion » section of the spec.

In our example, the mapping to languages and subjects are good candidates to be mapped to JSON terms. So we change

"language" : { "@id" : "dcterms:language", "@type":"@id" }, "subject" : { "@id" : "dcterms:subject", "@type":"@id" },

"language" : { "@id" : "dcterms:language", "@type":"@vocab" }, "subject" : { "@id" : "dcterms:subject", "@type":"@vocab" }, “fre” : “http://id.loc.gov/vocabulary/iso639-2/fre”, “eng” : “http://id.loc.gov/vocabulary/iso639-2/eng”,

Shorten remaining URI references

Now the only URIs left are the ids of the main entities in our graph, and references to those ids. Reference to controlled vocabularies with a limited number of values have been mapped to JSON terms. Although we cannot turn all the remaining URIs to JSON terms (because we can’t declare all possible entity URIs in the context), we can shorten them by adding a prefix mapping in the context, in our case:

"ark-https": "https://data.bnf.fr/ark:/12148/",

(I note that there are http:// and https:// URIs in the data, I don’t know why)

The frame specification

So now we have clean values, no URIs, no fancy JSON-LD keys. But we still don’t have a structure indented the way the average developer would expect it; and this is where the frame specification comes into play.

Define indentation and filters (and reverse properties if needed)

The frame specification acts as both a filter/selection mechanism and as a structure definition. At each level you indicate the criterias for the object to be included. In our example we have a skos:Concept (the entry in the library catalog) that is foaf:focus a Work (the Book « in the real world »), and that skos:Concept is the subject of many virtual exhibits. We want to have the Concept and the Work at first level, and under the concept the exhibits. But there is a trick : it is the virtual exhibits that points to the concept with a dcterms:subject, and we want it the other direction : Concept is_subject_of Exhibit, so we need a @reverse property.

To do that, add the following reverse mapping declaration: (don’t modify the existing one):

"subject_of" : { "@reverse" : "dcterms:subject" },

Note the use of « @reverse » to indicate that JSON key is to be interpreted from object to subject when turned into triples.

With that in place, we can write our frame specification, which goes right after the @context we have designed before:

"type" : ["Concept", "Work"], "subject_of" : { "type" : "ExpositionVirtuelle" }

Note how we use the terms defined in the context previously. This is to be understood the following way : « at the first level, take any entity with a type of either Concept or Work, then insert a subject_of key and put inside any value that has a type ExpositionVirtuelle ». This garantees the virtual exhibits objects will go under the Concept, and not above or at the same level. But this is not sufficient, as you will notice if you apply that framing that the Work is repeated under the « focus » property of the Concept, and at the root level. This is because of the default behavior of the JSON-LD playground regarding object embedding (objects are always embeded when they are referenced)

Avoid embedding

To avoid embedding when it is undesired, we can set the « @embed » option to « @never » on the « focus » property, like so :

"type" : ["Concept", "Work"], "subject_of" : { "type" : "ExpositionVirtuelle" }, "focus" : { "@embed" : "@never", "@omitDefault": true }

This tells the framing algorithm to never embed the complete entity inside the focus property, just reference the URI instead.

Also, you will notice the use of « @omitDefault » to true; this tells the framing algorithm to omit the focus property when it has no value. Otherwise, since the Work does not have a foaf:focus property (only the Concept), then it will get a « focus » key set to null.

What about order of keys in the JSON ?

Well, I am sure this can be controlled, either by specifying explicitely all the keys you want, in the order you want them, in the frame specification, or by using an « ordered » parameter to the JSON-LD API, but that is not available in the playground.

If you list all keys explicitely in the frame specification, don’t forget to use wildcards so that any value will match; wildcards are empty objects with « {} »:

"myProperty" : {}

The result

Much nicer no ? This is something you can put into the hand of an average developer.

Automate context generation from SHACL

Do you have a SHACL specification of the structure of your graph ? wouldn’t it be nice to automate the generation of the JSON-LD context from SHACL ? Maybe we could do that in SHACL-Play ? stay tuned !

Probably what we can automate is the context part, which can be global and unique for all your graph, but the framing specification should probably be different for each different API you need; each framing specification will then reference the same context by its URL.

Image : [Encadrement ornemental] ([1er état]) / .Io. MIGon 1544. [Jean Mignon] ; [d’après Le Primatice] https://gallica.bnf.fr/ark:/12148/btv1b53230250h

Cet article Clean JSON(-LD) from RDF using Framing est apparu en premier sur Sparna Blog.

Semantic Markdown Specifications

Thomas Francart — Thu, 20 Feb 2020 14:46:34 +0000

Markdown (MD) has become the de facto standard syntax for writing on the web, pushed by Github and StackOverflow. It is heavily used everytime one need to enter a comment, or write a simple (document-style) HTML page. What if we could embed semantic annotations in a markdown document ? We would get Semantic Markdown ! imagine the best of both worlds between human readable/writable documents and machine-readable/writable (RDF) structured data. We could feed an RDF knowledge graph that is coupled with our set of MD documents, and we would have an easy way to put structure in content.

I see a lot of potential in this, and already see some use-cases. Unfortunately I don’t have the bandwith, nor the full skills to make this happens. So I am just writing this in the hope that the idea is implemented by someone, or that someone tells me it is totally nonsense…

Here are the semantic annotations use-cases I see with such a Semantic Markdown :

Annotate a span or title that corresponds to an entity ;
Annotate a piece of text with an existing URI for an entity;
Create some statements on an entity;

Note that I am not necessarily looking for a way to produce RDFa annotations on the generated HTML, although that would be nice for a schema.org use-case. Any conversion route from the original semantically annotated markdown to a set of triples would be fine.

My source of inspiration is essentially Span Inline Attribute Lists » from the Kramdown syntax.

Annotate a span that corresponds to an entity

This piece of Semantic Markdown :

Tomorrow I am travelling to _Berlin_ {.schema:Place}

When interprered by a Semantic Markdown parser would produce this set of triples :

_:1 a  .
_:1 rdfs:label “Berlin” .

The span immediately preceding the « {.xxxx} » annotation is taken as the label of the entity. The use of rdfs:label to store the label of the entity could be subject to a parser configuration option.

One could imagine that a semantic markdown parser relies on the same RDFa Initial Context to interpret the « schema: » prefix without further declaration. But what about other ontologies ? we would need some kind of prefixes / vocab declaration somewhere in the document, just like in RDFa.

Note also that Markdown parser supporting the « {.xxxxx} » syntax will also insert this value as a CSS class on the corresponding span, so we win both on the CSS level and the semantic level.

Annotate a title

Similarly, we could annotate a title

### European Semantic Web Conference {.schema:Event}
Lorem ipsum...

In that case, the full content of the title is interpreted as the label of the entity :

_:1 a  .
_:1 rdfs:label “European Semantic Web Conference” .

Annotate with a known URI

Tomorrow I am travelling to [Berlin](https://www.wikidata.org/wiki/Q64) {.schema:Place}

Would yield

 a  .
 rdfs:label “Berlin” .

Describe an entity

If a list follows an annotated entity, then it should be interpreted as a set of predicates with this entity as subject :

### Specifications Meeting {.schema:Event}

* Date : _11/10_{.schema:startDate}
* Place {.schema:location} : Our office, Street name, 75014 Paris
* Meeting participants : 
  {.schema:attendee}
  * Thomas Francart{.schema:Person}
  * [Someone else](https://www.wikidata.org/wiki/Q80)
  * Tim Foo
* Description : Some information not annotated

### titre suivant
Lorem ipsum...

Should yield :

_:1 a  .
_:1 rdfs:label “Specifications Meeting” .
_:1  "11/10" .
_:1  "Our office, Street name, 75014 Paris" .
_:1  _:2 , , _:3 .

# attendee that is annotated : we know a type and a name
_:2 a 
_:2 rdfs:label “Thomas Francart” .

# attendee that is annotated with a URI : we keep the URI and add a label to it (?)
 rdfs:label "Someone else" .

# attendee that is not annotated - but we know he was an attendee
_:3 rdfs:label "Tim Foo" .

If a list follows a title or a paragraph that contains an annotated entity…
Then items in this list correspond to a property of this entity…
And can be annotated with a property
The property annotation can be placed on an inline text, or right before or after a `:` or `=` character
If the property annotation immediatly precedes a list, then all items in this list would be considered values for that property, and in that case could be either : entities annotated with a type, or entities identified by a URI, or entites not annotated (and in that case we would consider them as blank nodes with only a label

Related works

Metadata for Markdown, a Python extension to generated JSON-LD from YAML section in a Markdown document.

EDIT : PanDoc divs and spans : https://pandoc.org/MANUAL.html#divs-and-spans

I like the syntax :

[This is *some text*]{.class key="val"}

This is close ! but still would not produce triples, unless one writes explicitely RDFa :

My name is [Thomas Francart]{typeof="schema:Person"}

Cet article Semantic Markdown Specifications est apparu en premier sur Sparna Blog.