Skip to main content

Mapping a graph to a tree data structure: a case study using the Core Vocabularies

SEMIC line

Table of Content

SEMIC line

1. Introduction

This blog post is part of a series of blog posts published by SEMIC.

The subject of this blog post is on XML/Mapping and the intended audience are semantic engineers, knowledgeable in building ontologies, that would like to understand how to move from a conceptual model to a physical data model, in particular XML, and explore how the two can be connected.

As RDF can take different serialisations, including RDF/XML, semantic engineers are expected to have a basic knowledge of XML syntax with its data types. 

For a quick introduction to XML schema, there are several tutorials online like the one of w3schools.com to which the blog post refers to from time to time. This blog post aims to go further in-depth on the considerations while designing an XML schema.

Readers are invited to provide their feedback and opinion on the different sections, in particular where questions for the communities are raised.

SEMIC line

2. Understanding the Roles in a Data Exchange

Public Administrations manage information systems to gather information that can then be publicly consulted, such as catalogues of public services or base registries such as population, environment, etc. 

A Public Administration interacts with other administrations on different levels (local, national, European). It is therefore inevitable that such information systems require exchanging data between each other. In this sense Public Administrations act as:

  • Sender, to provide data to the public and to the other public administrations. As sender, they need to make sure that the data is of good quality and that the information system guarantees interoperability over time;
  • Receiver, to collect data from other publication administrations. As receiver, they need to store the data but also make sure they understand it properly. Otherwise it is not possible to transform it to an internal data model or be published (regulations such as GDPR might need to be taken in consideration);
  • Proxy, to forward the data between two (or more) information systems. Most likely the two information systems have adopted different models. Therefore, it is crucial for a proxy to ensure maximal interoperability between the two.

In all these cases, data management is essential to perform data exchange, transformation and publication, ensuring that interoperability is reached.

When talking about interoperability, there are six layers of interoperability: Legal, Organisational, Semantic, Technical and 2 cross-cutting, that need to be taken into account. This blog post focuses mainly on the interactions between semantic interoperability and technical interoperability.

Both layers are important. However, starting from the semantic layer allows a deep understanding of the data management, making it easy to pass to technical interoperability, which can be reached in different forms using different serialisations and protocols, and at the same time, bringing in additional constraints. 

SEMIC line

3. Dealing with Conceptual and Physical Data Model

Within semantic interoperability, conceptual models need to be designed making sure that concepts, with their properties and relations, are dealt with the maximum accuracy so that everybody understands their meaning. 

This process might take some time because reaching consensus on the meaning of the concepts, requires setting up coordination (managing working groups expectations) and follow up on policies that are built over time such as new legislation taking effect. This is the case for the SEMIC action in which Core Vocabularies and Application Profiles are built around their respective communities.

While conceptual models could be described at minima by a UML class diagram, easily understood by the business and data architects, it could also be expressed by knowledge graphs allowing to describe ontologies or vocabularies that can be interconnected. The most recent SEMIC style guide provides guidance on this approach, making it easy to understand the necessary steps to achieve this.

A well-recognised standard in this context is RDF, further enhanced by RDFS and OWL, which allows one to identify unique concepts supported by the SHACL language, to describe constraints that might apply. This is especially important for Application Profiles like DCAT-AP or CPSV-AP which are maintained by the SEMIC action.

A conceptual model can therefore be realised as a set of artefacts that are inter-connected and each used in different contexts . For example, the change in the cardinality of a property might impact the related SHACL shapes (where cardinalities restrictions apply) useful for validation but not in the ontology (where only concepts are expected to be defined).

The move from semantic to technical interoperability, commonly referred to as lowering, inverse of lifting, requires a different approach because of the different restrictions imposed by the adopted serialisation. For more information, see section Approaches in Mapping/Transforming a Conceptual Model with a Physical Data Model.

In this blog post we take a deeper look at the XML serialisation and its XML schema. The latter, provides a way to describe data that can reach a certain level of complexity. 

The design and implementation of an XML schema is important for senders, receivers and proxies that will have information on how to parse and transform the received XML data. 

In particular, receivers and proxies could already leverage the XPath language to locate elements in the XML data that could be used by the XSLT language to transform XML data, by XQuery to perform queries, by Schematron to perform validation. Therefore, having an XML schema associated with the data, can help to create reliable transformations and queries.

Further, sender, receivers and proxies need to be aware that tools to perform transformations, converting XML schema into programming languages object (for example a Java class), might impose limitations on the version of the XML schema that can be used. For example the commonly used JAXB library does not support the latest version XSD 1.1 yet.

SEMIC line

4. The Importance of a Schema

In this section we want to raise the important points to consider when designing an XML schema referring back to a conceptual model. 

As detailed below, when designing a conceptual model, there are different considerations to take in mind such as validation, versioning, order, data types, etc.

4.1 Validation

SHACL shapes can be used to validate data instances against the conceptual model. In this context, SEMIC provides SHACL shapes to support the data receiver, proxy and sender in making sure that data is valid. In particular it relies on the ITB service for CPSV-AP and DCAT-AP SHACL validation.

Likewise, an XML schema can be used to validate XML data against a determined structure. 

A sender can provide such XML schema and associate the XML schema related to the XML data via the xsi:schemaLocation. This enables receivers to validate the received XML data with that version of the XML schema in a straightforward way.

For example:

Validation

Question for the community: Currently SEMIC does not provide an XML schema validation service for the Core Vocabularies such as it is done for ITB CPSV-AP and DCAT-AP SHACL validators. Is this type of service needed ? 

4.2 Versioning

Enriching ontologies with metadata (provenance, version, licence, etc.) is good practice for publishing and enables tracking of their evolution. 

Likewise, as an XML schema can change, it is important to define metadata about it. 

Among other considerations to be made, versioning is important. It not only allows to describe the structure of the XML schema at a certain point in time but also helps to associate the XML data exchanged with that particular version. There are different approaches to manage versioning:

  1. Change the (internal) schema version attribute;
  2. Create a schemaVersion attribute on the root element;
  3. Change the schema's targetNamespace;
  4. Change the name/location of the schema.

 A Core Vocabulary XML schema could indicate its version via the schema version attribute, such as the example below:

Versioning

Question for the community:  Currently SEMIC does not provide an XML schema. What is the minimum metadata needed for an XML schema aside from versioning?

4.3 Order Matters

In a conceptual model, the use of classes allows to logically group attributes. It doesn’t necessarily mean that these attributes will be exactly represented in that order in the XML schema. XML schema provides a way to establish the order of the properties via xs:sequence, so that receivers know what to expect when receiving XML data, or by using xs:all when the order is not important.

An example for the use of Agent in Core Person could be the following where the “name” and “type” properties are put in sequence:

Order Matters

Using xs:all could be limiting for receivers when parsing an XML schema via a complex XPath using preceding or following siblings axes to locate relative elements in XML tree. Be also aware that xs:all is not suggested in NIEM and UBL guidelines, see section Following the rules.

4.4 Data Types

As well known by ontologists, XML schema provides a variety of data types by default. It is helpful for receivers to know what value types to expect.

For example, within Core Location, the class Geometry, has some properties with data type xsd:String allowing interoperability with solutions conformant to the INSPIRE Directive.

However, in Core Person, a GenericDate data type is just defined as union of different XML schema date types (xs:date, xs:gYearMonth and xs:gYear) to identify the date of birth/death. This could be translated into:

Data Types 1

Be aware that xs:union is strict to only those defined values and the GenericDateType might need to be extended by reusers of the XML schema.

For other types used in RDF such a Literal or langString a mapping could be established like the following:

Data Types 2

Question for the community:  Concerning this latter example, is it important to bring a mapping like the Literal expression within the XML schema ?

SEMIC line

5. Design for Reuse

In this section we are going to explore points that can support XML schema implementers when designing an XML schema to be reusable.

5.1 Reusing XML Schema

A good practice for ontologists is to reuse concepts coming from different vocabularies so that ontologies are interconnected. This is possible by pointing to their URI or via the owl:import statement. XML schema allows for importing external XML schemas too via the xsd:import statement.

For example, Core Vocabularies are based on DCTERMS, that provides an XML schema, so Core Person could import the DCTERMS XML schema for the usage of AgentType :

Reusing XML Schema 1

However, not all ontologies are also expressed in XML schema. For example, Core Vocabularies are based on DCTERMS and FOAF. While for DCTERMS there is one that could be reused, FOAF doesn’t have it; therefore, it is up to senders to create, from scratch, the related XML schema, which might differ from organisation to organisation. 

Below an example of how an Agent could be expressed within the Core Vocabulary namespace as equivalent to the one declared in Core Public Event:

Reusing XML Schema 2

Question for the community:  What kind of approach should be considered when integrating external namespaces that do not have a respective XML schema ?

5.2 Intrinsic Modularity

Vocabularies and ontologies are usually composed of classes related between each other, and with their own properties, such as SEMIC Core Vocabularies and the Application Profiles DCAT-AP and CPSV-AP.

Classes aggregate a set of properties with their data types; such properties and data types could be reused. For example:

  • Text, Literal, or Code are the most common data types in the Core Vocabularies and
  • properties like “description” are commonly used within classes. In particular, in CPOV, the property “description” appears in Public Organisation, Change Event and Temporal Entity.

By dividing an XML schema in different files, one for each specification and others for common data types, reusers of the XML schema can decide to reuse partially (just the data types or common properties) or the entire XML schema.

Intrinsic Modality

Question for the community: Is the above proposed approach of structuring Core Vocabularies in different files useful? 

5.3 Extension Mechanisms

While ontologies are defined in the open-world assumption (OWA), giving a freedom to add new concepts, SHACL shapes can still provide a level of flexibility with closed constraint components when validating.

In XML schema the principle is the opposite. This is already evident as minimal and maximal cardinalities are set to 1 by default. It is up to the sender to provide a way to extend it. In XML schema this is possible in different ways such as xs:any to add any type of elements or by using element substitution.  Be also aware that guidelines like NIEM and UBL diverge in the approach, see section Following the rules.

 

For example using xs:any, similarly to UBL, would be:

Extension Mechanisms 1

While with a substitution group, similarly to NIEM, it would be:

Extension Mechanism 2

Questions for the community:

  • Is the approach of  providing an extension mechanism for each defined type useful for the SEMIC community?
  • If yes, what kind of approach should be taken? 

5.4 Multiple Inheritance and Multiple Instances

Depending on how a semantic model is created, an editor can decide to create a class as subclass of multiple classes (example a "Host" class is a subclass of a "Person" or "LegalEntity") or create an instance of multiple classes (example "Person123" is a instance of "Participant" and "Performer"). This is not possible in XML schema as a complexType can extend/restrict from one complexType only or an element can be associated to only one complexType. Therefore a strategy should be put in place to maintain the same meaning.

The idea of multiple inheritance or multiple instance is, in general, to reuse properties from different classes. A technique could be making use of groups, for example:

Multiple Inheritance and Multiple Instances

Questions for the community: Currently SEMIC Core Vocabularies do not have classes inheriting from multiple classes but reusers could have.  Is the approach above needed for reusers for Core Vocabularies when expressing them as XML schema ?

5.5 Choosing the Right Pattern

When designing an XML schema there different approaches can be taken but impact the reusability in its own way. Among them the most preferred should be Venetian Blind or Garden of Eden because complex types can be reused by different elements, so then the XML schema could be extended.

The difference between the two is that Venetian Blind allows for one global element, a sort of starting point of an XML message while Garden of Eden can allow multiple global elements/starting points. 

As ontologies are represented as graphs; there is no real starting point. Sometimes certain classes are more connected than others and not all the relations are bidirectional.

Choosing a starting point in the XML schema implies a direction that should be reflected in an ontology as a direct relation from one class to another.

If we look at the Core Business Vocabulary, choosing to start from the Legal Entity class is a way to reach all the other classes.

For example a possible implementation with Venetian Blind (LegalEntity is the only global element having elements) would be:

Choosing Right Pattern 1

However, one could start from the Identifier class, that in the Core Business Vocabulary is related to the Legal Entity via the "identifies" relation  :

Choosing Right Pattern 2

While with Garden of Eden this would be (notice the two global elements LegalEntity and Identifier):

Choosing Right Pattern 3

Obliging end users to use only a starting point might be constraining to represent information, therefore it is recommended to use the Garden of Eden pattern. Note that this approach is used also by NIEM and UBL, see section Following the rules.

5.6 Following the Naming and Design Rules

Along the above sections we have seen different ways that a sender would use to improve the reusability of the XML schema.

These ways are expressed by two famous XML schemas that are NIEM and UBL. Both are famous in different domains and scopes. While NIEM focuses on a set of concepts that can be reused in different domains, UBL focuses on the business language. While providing XML schema, they both put strict Naming and Design Rules respectively.

Some of these rules are in common, such as:

  • The XML schema must have a version, see NIEM, UBL;
  • All elements declarations must be global, see NIEM and UBL;
  • All types must be global, see NIEM and UBL;
  • No use of xs:all, see NIEM and UBL;
  • No use of xs:any, see NIEM and UBL (NIEM speaks about augmentation point while UBL provides its extension element);
  • No use of xs:choice, see NIEM and UBL.

Other rules are different such as:

  • On the use of substitution groups, see NIEM and UBL
  • On the use of union, see NIEM and UBL

From the linked data point of view, we can notice that:

Questions for the community: Aside of the common rules that could already be adopted, what kind of approach could be followed up? 

SEMIC line

6. Approaches in Mapping/Transforming a Conceptual Model with a Physical Data Model

This section intends to illustrate what kind approaches exist when mapping a conceptual model with a physical data model, existing standard and tools

6.1 Approaches

Lowering and lifting are operations that require mapping of the concepts of the conceptual model with those of the physical model.  These mappings help to transform instances of the conceptual model into instances of the physical model (and vice versa) or staying simply at conceptual mapping.

 There are different approaches that one can take that will be discussed in the next subsections:

  1. Lowering, from RDF to XML schema
  2. Lifting, from XML schema to RDF
  3. Using a domain language

6.1.1 Lowering, from RDF to XML Schema

As ontology designers know, RDF can be expressed in different serialisations such as Turtle, JSON-LD and RDF/XML. The latter have the same syntax of XML, that can be suitable for a transformation. Indeed, one could think to apply XSLT to perform such transformation.

When RDF and XML schema belong to different organisations, it could be wise to first perform a mapping between the concepts described in RDF with those in the XML schema, by pointing to their respective XPath in the XML document, to find first a common agreement. Such Xpaths could be useful in a next step to perform an XSLT transformation.

There are also a couple of specifications that can be useful:

1) XSPARQL, a 2009 W3C submission specification, that is based on SPARQL;

2) STTL, a 2017 Inria specification that is based on SPARQL too but uses templates like XSLT.

In the latter case there are a couple of open source tools that can support STTL such as Corese and SPARQL-Generate.

6.1.2 Lifting, from XML schema to RDF

There are different techniques that can help mapping an XML schema to RDF.

W3C proposes at least two approaches:

  1. GRDDL, a 2007 W3C recommendation, where each XML schema element is mapped to the equivalent RDF Concept via an XSLT transformation. In this way, XML instances can be converted to RDF instances indirectly via XSLT transformations. However, such mappings require a change in the XML schema, by adding them in the annotation section. 
  2. SAWDL, a 2007 W3C recommendation, that allows to either:
    1. mapping each XML schema type/element/attribute to a URI of the conceptual model via an XML schema attribute (model reference approach). This approach is for example used by the Finnish Data Portal
    2. mapping each XML schema type/element/attribute to a XSLT that performs the transformation via an XML schema attribute (schema mapping approach). This second approach is similar to GRDDL.

Among existing open source tools there are different approaches:

  • Redefer (last update in 2017), adopted an XSLT to map each XML schema construct to an OWL construct.
  • Ontmalizer (last update in 2018) used a programmatic approach (Java) to perform the transformation in OWL (by using the Jena Ontology API)
  • XMLSchema2Shex (last update in 2018), preferred to use a programmatic approach (scala) to perform the transformation into Shex shapes. 
  • XSD2SHACL (last update in 2023), prefers to use a programmatic approach (Python) to perform the transformation into SHACL shapes.

6.1.3 Using a Domain Specific Language

An alternative approach to those above, is to start from a domain specific language that allows to generate both RDF and XML schema.

Among the tools there is SHAX (last update in 2019, freely available on GitHub), which starting from the SHAX language, can generate RDF and XML schema (and JSON schema). To do so, SHAX leverages a Semantic Map that associates a RDF IRI to the XPath of the XML elements.

SEMIC line

7. Conclusions

RDF models and XML schema are similar on certain aspect but have many differences. A data model designer needs to take in account the constraints that will be encountered and look for strategies to enhance the reusability of the data model. When moving from one to another, the designer can leverage existing standards and tools to perform transformation.

Concerning Core Vocabularies, as seen along the blog post, there are different open questions that the community can help with, such as:

  1. The need for a validation service
  2. What kind of metadata is needed
  3. How to define data types
  4. If to reuse and how to reimplement external Xml schemas
  5. Define the level of modularity of the Core Vocabularies
  6. How to allow extensions
  7. How to allow multiple inheritance
  8. What kind of rules need to be followed

By reaching consensus on such questions, Core Vocabularies could find better expression in XML.

SEMIC line

Please feel free to provide feedback on this blog post. This can be done in various ways. Leave a comment on this article, create an issue on the Style Guide GitHub repository, where we shared our models, tagged with the label Blog-XMLMapping, or share your thoughts at various SEMIC meetings or webinars organised on related topics, such as the upcoming SEMIC Style Guide Webinar.

3 dots SEMIC