Current page last modified at: 20-Oct-2017 13:04:49

To Be Refactored for version 2

ATTX Semantic Broker - Data Administrator's User guide

Introduction

This user guide is for people who have an ATTX Semantic Broker platform instance up and running and want to start working with data. ATTX uses UnifiedViews ETL tool as the tool to define, execute, monitor, debug, schedule data related processes. In UnifiedViews these processes are defined as pipelines, which consists of steps that are implemented using DPUs (Data Processing Unit).

The traditional way of using ETL tools is to create potentially complex pipelines that Extract data from one or more data sources, run multiple Transformations and finally Load the data to another system. In these scenarios, ETL tool sits in the middle, handling data flows from A to B. In ATTX Semantic Broker, ETL is part of the system itself and the system can use itself as the data source. From the ETL point of view, it is a opinionated framework for building pipelines.

ATTX Semantic Broker has three types of pipelines: Ingestion, Processing and Distribution. Each type has its own characteristics, but the main ideas are as follows:

Ingestion pipelines are the only ones that work with external data;
Processing pipelines only work with internal data;
Distribution pipelines are the only ones that are used to expose/distribute/publish data to the public.

Table of Contents

ATTX Semantic Broker - Data Administrator's User guide

Pipeline Types

Every pipeline essentially represents one or more datasets.

Ingestion Pipelines

Simple download and replace

The simplest ingestion pipeline consists of three DPUs as depicted in figure X. In this example the downloaded data is already in RDF format and requires no transformations. Replace dataset loader means that old version of the data are always completely replaces with the new version.

This kind of pipelines can be used to download ontologies or vocabularies to the Semantic Broker.

Download and transform

It more common that the source data is not in RDF format or that the structure of the data is not exactly what you want store internally. Figure X shows an example where RMLTransformer is used to transform from example CSV file into RDF before adding in to the Semantic Broker's data storage.

Both replace and update data set DPUs require RDF input, so transformation step is required for all input data that is not available in RDF format.

Incremental harvesting

OAI-PMH harvesting interface allows one to harvest new records incrementally based on the timestamp they were last modified. Incremental harvesting requires UpdateDataSet DPU to be used, or otherwise the target dataset will only contain the latest changes.

Example workflow has two Loader DPUs, because we want to store clustered ids in to a separate dataset so that we can update the clustering configuration more easily.

Processing Pipelines

Link by ID

This example uses clustered ids and harvested datasets as the source data and generated new explicit links between resource. For example:

Clustered ids:

:pub1
  attx:id "urn:1"

:org1
  attx:id "orgID1"

working data 1

:pub1
  a Publication
  urn "urn:1"
  org "orgID1"

linkByID data set

:pub1
  orgLink :org1

Simple reasoning

OWNReasoner can be configured to use either pipeline specific ontology or ontology data set as it's configuration. This example shows an example of the latter case. SelectExistingDataSets DPU connected to the Ontology URI port is the data set that contains ontology triples where as the other SelectExistingDataSets DPU refers to the data set that contain triples that will the target of the reasoning process.

Distribution Pipelines

Basic distribution pipeline example that select all the required source data and uses the custom RDF to JSON mapper to transform graph into documents. Resulting documents are added to the REST API using PublishToAPI DPU.

ATTX DPUs

ATTX Semantic Broker ships with custom UnifiedViews DPUs that must be used when designing pipelines. DPUs have been categorized into extract, transform/generate and load classes and in most cases every pipeline contains at least one DPU from each category.

Extractors

ATTX extractor DPU are used to configure ingestion of external data source to the Semantic Broker or selection of existing data source(s) for processing or dissemination.

File Download

This DPU simply downloads and stores file from the given URL and passes it on to further processing. It can be used update dataset that are based on for example CSV exports.

Configuration:

URL *
username
password
method
payload
ignore SSL errors
timeout

Output:

files

OAI-PMH Harvester

This is a more complicated downloader that works with data sources that comply with OAI-PMH 2.0 metadata harvesting protocol. This harvester can be used for selective and incremental harvesting via set and from and until parameters. Set defines a named set of records i.e. "Openly_available_theses" for harvesting. From and until are YYYY-MM-DD (date) formatted string that can be used to filter harvested records based on their timestamps. Another way to filter records by times-tamp is to harvest only records that were added/modified/deleted since the previous successful execution of the pipeline.

If the data source supports tracking of deleted records, those are outputted as a separate set of OAI-PMH identifiers.

Configuration:

base URL *
metadata format *
set
Filter by date (bool)
- from date
- until date
- since previous successful harvesting

Output:

deleted ids
- OAI-PMH identifiers of deleted records.
modified records

Select Existing Datasets

This DPU provides search interface to the existing dataset metadata and allows users to reuse their content as the data source for pipelines. This DPU must be used as a loader for both processing and distribution pipelines.

Configuration/Output:

Selected dataset URIs

Transformers

ATTX transformers create new data. New data can be metadata about the datasets or data that is generated through some type of processing, such as linking, reasoning or any kind of transformation.

Describe External Data Source

This DPU is used to input simple metadata about the data source that is being used as the basis of internal data set. The main difference between internal data set and external data source, is that latter must contain some type of indication of license. Same licensing information is then attached to any derived dataset from this data source. In the case two external datasets are used in the Processing pipeline with distinct licenses, the license associated with the derived dataset will be the most restrictive.

Configuration:

Name *
Description
License *

Output:

Data source metadata

Describe Data Set

Provides simple metadata for the internal dataset, which is used for example by the "Selected existing datasets" DPU.

Configuration:

Name *
Description

Output:

Data set metadata

Cluster IDs

This DPU generates triples that cluster all the different resource identifier under one common property.

When used as part of a ingest pipeline, one must use the RDF input connector. This provides the currently ingested RDF as the source for clustering.

This DPU can also be used as part of a processing pipeline when paired with "Select existing data sets" DPU. This way the source data is the full content of one or more working graphs.

How about a case where data about the same Thing is coming in from two different sources?

Configuration:

list of properties to use for clustering as URIs

Output:

Clustered IDs RDF

Example:

Configuration:

dc:identifier, custom:urn

Input

<attx:ds/1/work/infra/1>
  <dc:identifier> "infra/1" ;
  <custom:urn> <urn:2> .

Output:

<attx:ds/1/work/infra/1>
  <attx:id> "infra/1" ;
  <attx:id> <urn:2> .

RML Transformer

Uses RML processor to transform CSV, XML and JSON files to RDF.

Configuration:

RML configuration

Output:

RDF

Link by ID

Creates new explicit links from identifiers using the clustered IDs information.

Configuration:

Source type
Source property
Target property

Output:

RDF

Example:

working data 1

:pub1
  a Publication
  urn "urn:1"
  org "orgID1"

:org1
  extID "orgID1"

ids graph

:pub1
  attx:id "urn:1"

:org1
  attx:id "orgID1"

Configuration

source type: Publication
source property: org
target property: orgLink

New data

:pub1
  orgLink :org1

Custom RDF to JSON Mapper

Uses custom configuration file to traverse and transform RDF to nested JSON documents.

TODO: Add more detailed description of mapping file.

RDF to JSON mapper can only work on existing data sets and cannot therefore be part of a ingestion pipeline.

Input:

dataset URIs

Output:

Custom mapping configuration

JSON-LD Framing Based RDF to JSON Mapper

Uses JSON-LD framing to create deterministic serialization of a graph to JSON document.

RDF to JSON mapper can only work on existing data sets and cannot therefore be part of a ingestion pipeline.

Input:

dataset URIs

Output:

Framing configuration

Loaders

ATTX Loaders are components handle generated data by either storing it internally or handing it over to a distribution component for public consumption. Loader DPUs only have inputs.

Replace Data Set

DPU that replaces all the old content with new data. This DPU should be used, when the source data is from a data dump.

Input data can be configured to be RDF passed down from previous DPUs or file generated by some other DPU.

Inputs:

Data set metadata
New data
New data files

Update Data Set

Update DPU uses input data to create a SPARQL update request, which will preserve any existing data that is not part of the update. It also handles situations where value of a certain property has changed without creating duplicate property values. For example:

Old data:

<subject> <dct:title> "Test"

SPARQL update:

DELETE { <subject> <dct:title> "Test" }
INSERT { <subject> <dct:title> "Testing" }

New data:

<subject> <dct:title> "Testing"

Inputs

Data set metadata
Deleted IDs
Updated data

Target graph is generated based on the contextual information.

Publish to API

This DPU can be used to distributed generated JSON documents through a REST API.

Input:

dataset metadata
configuration

Publish to File

This DPU copies file to a web server with given path and file name.

Configuration:

path
File name

Semantic Broker End to End Usage Scenario

This section provides a end to end example that uses ATTX Semantic Broker platform from harvest, to linking and distributing the data maintained by three different sources.

Data sources are:

Helda publication data
Finto vocabularies
HY organization registry

A first step consists of linking publication metadata with Finto vocabularies and organizational units using subject and contributor field respectively. In the following step it will use simple OWL reasoning on SKOS vocabulary to add links from linked subject's parent concepts to the publication. This allows users for example to search for "Beverage" and find things described as "Tea". Finally, the distribution pipeline will add multilingual labels for the linked subjects.

Example of result document:

{
  "title": "Testing",  
  "org": {
    "name": "Helsinki University Library",
    "orgUnitCode": "Test"
  },
  "subjects": [
    {
      "id": "http://www.yso.fi/onto/yso/p5127",
      "label": {
        "fi": "juomat",
        "en": "beverages",
        "sv": "drycker"
      }
    }
  ]
}

To be continued...

results matching ""

No results matching ""