Current page last modified at: 07-Dec-2017 12:32:44

Linking Strategies

This page describes how different graphs in the data store are linked together. There are two different scenarios for linking:

  • linking between platform's internal graphs (such as provenance and generated datasets)
  • linking between datasets that are part of the working data through workflow executions.

Table of Contents

Provenance Data

Core content of the provenance graph (PG) are descriptions of workflows and their executions. Workflow description also contains links to the related steps whereas workflow execution descriptions have links to the generated dataset.

URIs for resources come from external IDs assigned by the workflow system (e.g. workflow IDs) and from the configuration of the workflow itself (e.g. dataset URIs).

Dataset URI refers to the named graph in the working data that contains all the triples generated by the associated activity.

<attx:provenance> {
  <attx:wf/1>
    a attx:Workflow ;
    dct:title "Harvest UH's publication and infrastructure data + links between them" .

  <attx:wf/2>
    a attx:Workflow ;
    dct:title "Harvest national infrastructure bank's descriptions"

  <attx:activity/1>
    a attx:Activity ;
    prov:qualifiedAssociation [
      prov:hadPlan <attx:wf/1>
    ] ;
    prov:generated <attx:ds/1> .

  <attx:activity/2>
    a attx:Activity ;
    prov:qualifiedAssociation [
      prov:hadPlan <attx:wf/2>
    ] ;
    prov:generated <attx:ds/2> .

  <attx:ds/1>
    a attx:Dataset .
    # we can add more interesting attributes here such as owner of the harvested data...

  <attx:ds/2>
    a attx:Dataset .

}

Working Data

Working data is partitioned into named graphs that each contain the triples generated by a certain activity (i.e. workflow execution).

Platform generates "brokered" URIs for every resource in the working data. These URIs should be generated using common base name, some workflow related metadata and original metadata, such as external identifier(s).

Examples of working data generated by data acquisition workflows

# Using TriG - https://en.wikipedia.org/wiki/TriG_(syntax)
<attx:ds/1> {
  <attx:ds/1/work/pub/1>
    dct:identifier "pub/1" ;
    dct:identifier <doi:1> ;
    dct:title "test" ;
    dct:publisher "HY" .

  <attx:ds/1/work/pub/2>
    dct:identifier "pub/2" ;
    dct:title "test 2" ;
    dct:publisher "HY" .
    custom:relatedInfra <urn:2>

  <attx:ds/1/work/infra/1>
    dct:identifier "infra/1" ;
    attx:otherID <urn:2> ;
    dct:title "Infra 1" .

  <attx:ds/1/work/pub/1> attx:hasRelatedInfra <attx:ds/1/work/infra/1> .
}

<attx:ds/2> {
  <attx:ds/2/work/infra/1>
    dct:identifier <urn:2> ;
    dct:title "Infra 1" ;
    infra:service [
      rdfs:label "Service 1"
    ] ;
    infra:service [
      rdfs:label "Service 2"
    ] .
}

<attx:ds/3> {
  <attx:etsin/urn:1>
    dct:source <etsin:urn:1> ;
    dct:title "Test dataset" ;
    dct:publisher "HY" .
}

Linking Datasets

Linking Provenance and Working Data Graphs

The linkage between the Provenance and Working Data graphs can be achieved by using the provenance graph IDs for as the named graph for the working data. In the examples above the <attx:ds/2>, <attx:ds/1> and <attx:ds/1> illustrate such a linkage. This process is under the control of the user which specifies the inputGraphURI and the outputGraphURI for given Working Data graphs.

Linking Between Working Data Graphs

Linking between datasets in the working data should always happen through some kind of processing workflow.

  • This makes is harder to use existing links in the sources data, because you always have to do processing even if the source data contains external identifiers.
  • Makes it easy to manage and run multiple different linking implementation on the same datasets. For example: adding skos:exactMatch links based on identifiers, identifier based linking on object properties, NER based linking using software X, NEW based linking using software Y, etc.
  • It allow one to attach metadata to the graphs that contain generated links such as different kinds of quality metrics.

Resources in the graph store can have more than one identifier, and each of them can be used to link resources together. For example <attx:ds/1/work/infra/1> has two identifiers where the secondary identifier links it to the description in the graph <attx:ds/2>.

Link processing happens against the working data in two stages:

  • In the first stage the identifiers in the selected datasets are clustered so that we have mapping of platform's internal resource URL to all its identifiers.

Example of platform URL to identifier mappings in Turtle. Using working data example above with <attx:ds/1> and <attx:ds/2> as the input graphs.

<attx:ds/1/work/pub/1>
  <attx:id> "pub/1" ;
  <attx:id> <doi:1> .

<attx:ds/1/work/pub/2>
  <attx:id> "pub/2" .

<attx:ds/1/work/infra/1>
  <attx:id> "infra/1" ;
  <attx:id> <urn:2> .

<attx:ds/2/work/infra/1>
  <attx:id> <urn:2> .

The <attx:id> refers to a set of properties which depict identifiers. Implementing such as set of properties could be achieved by having a super-property which depicts a general identifier and multiple sub-properties which characterise specific identifiers either by class type (classes) or by data types (values).

  • In the next step (second stage) we generate new triples for linking platform URLs using the clustered IDs and values of selected properties in the original graph. For example generating skos:exactMatch between infrastructure descriptions in graphs <attx:ds/1> and <attx:ds/2>.
construct {
  ?r1 skos:exactMatch ?r2
}
from <clustered ids.ttl>
where {
  ?r1 attx:id ?id .
  ?r2 attx:id ?id .
  filter(?r1 != ?r2) # filtering out trivial <x> skos:exactMatch <x> triples
}

results of identifier based linking processing:

<attx:links/1> {
  <attx:ds/1/work/infra/1> skos:exactMatch <attx:ds/2/work/infra/1>
}

Another scenario is to use clustered IDs with the original graph data to generate new links based on object properties.

For example linking pub/2 with the infra <urn2:> based on the value of the publication's custom:relatedInfra property.

construct {
  ?r1 attx:hasRelatedInfra ?r2
}
from <clustered ids.ttl>
from named <attx:ds/1>
from named <attx:ds/2>
where {
  ?r1 custom:relatedInfra ?infra .
  ?r2 attx:id ?infra
}

results of identifier based linking processing on an object property:

<attx:links/1> {
  <attx:ds/1/work/pub/2> attx:hasRelatedInfra <attx:ds/2/work/infra/1>
}

results matching ""

    No results matching ""