|Last release:||0.5 (2012/03/28)|
|License:||Apache License, Version 2.0|
|Affiliation:||Freie Universität Berlin|
LDIF - Linked Data Integration Framework
The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. The usage of different vocabularies as well as the usage of URI aliases exacerbates SPARQL queries against Web data, which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data. LDIF translates heterogeneous Linked Data from the Web into a clean, local target representation while keeping track of data provenance.
The LDIF integration pipeline consists of the following steps:
- Collect Data: Access modules locally replicate data sets via file download, crawling or SPARQL.
- Map to Schema: An expressive mapping language allows for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary.
- Resolve Identities: An identity resolution component discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics.
- Quality Assessment and Data Fusion: A data cleansing component filters data according to different quality assessment policies and provides for fusing data according to different conflict resolution methods;
- Output: LDIF outputs the integrated data and that can be written to file or to a QuadStore. For provenance tracking, LDIF employs the Named Graphs data model.
The LDIF Framework consists of a Scheduler, Data Import and an Integration component with a set of pluggable modules. These modules are organized as data input, data transformation and data output.
The Scheduler is used for triggering pending data import jobs or integration jobs. It is configured with an XML document and offers several ways to express when and how often a certain job should be executed.
This component is useful when you want to load external data or run the integration periodically, otherwise you could just run the integration component.
LDIF provides access modules for replicating data sets locally via file download, crawling or SPARQL. These different types of import jobs generate provenance metadata, which is tracked throughout the integration process. Import jobs are managed by a scheduler that can be configured to refresh (hourly, daily etc.) the local cache for each source.
- Triple/Quad Dump Import
In order to get a local replication of data sets from the Web of Data the simplest way is to download a file containing the data set. The triple/quad dump import does exactly this, with the difference that LDIF generates a provenance graph for a triple dump import, whereas it takes the given graphs from a quad dump import as provenance graphs. Formats that are currently supported are RDF/XML, N-Triples, N-Quads and Turtle.
- Crawler Import
Data sets that can only be accessed via dereferencable URIs are a good candidate for a crawler. In LDIF we thus integrated LDSpider for crawl import jobs. The configuration files for crawl import jobs are specified in the configuration section. Each crawled URI is put into a seperate named graph for provenance tracking.
- SPARQL Import
Data sources that can be accessed via SPARQL are replicated by LDIF’s SPARQL access module. The relevant data to be queried can be further specified in the configuration file for a SPARQL import job. Data from each SPARQL import job gets tracked by its own named graph.
Integration Runtime Environment
The integration component manages the data flow between the various stages/modules, the caching of the intermediate results and the execution of the different modules for each stage.
The integration component expects input data to be represented as Named Graphs and be stored in N-Quads format accessible locally – the Web access modules convert any imported data into N-Quads format.
LDIF provides trasformation modules for vocabulary mapping and identity resolution.
- R2R Data Translation
LDIF employs the R2R Framework to translate Web data that is represented using terms from different vocabularies into a single target vocabulary. Vocabulary mappings are expressed using the R2R Mapping Language. The language provides for simple transformations as well as for more complex structural transformations and property value transformations such as normalizing different units of measurement or complex string manipulations. The syntax of the R2R Mapping Language is very similar to the query language SPARQL, which eases the learning curve. The expressivity of the language enabled us to deal with all requirements that we have encountered so far when translating Linked Data from the Web into a target representation. Simple class/property-renaming mappings which often form the majority in an integration use case can also be expressed in OWL/RDFS (e.g ns1:class rdfs:subClassOf ns2:clazz).
An overview and examples for mappings are given on the R2R website. The specification and user manual is provided as a separate document.
- Silk Identity Resolution
LDIF employs the Silk Link Discovery Framework to find different URIs that are used within different data sources to identify the same real-world entity. For each set of duplicates which have been identified by Silk, LDIF replaces all URI aliases with a single target URI within the output data. In addition, it adds owl:sameAs links pointing at the original URIs, which makes it possible for applications to refer back to the data sources on the Web. If the LDIF input data already contains owl:sameAs links, the referenced URIs are normalized accordingly. Silk is a flexible identity resolution framework that allows the user to specify identity resolution heuristics which combine different types of matchers using the declarative Silk – Link Specification Language.
An overview and examples can be found on the Silk website.
- Data Quality Assessment and Fusion
LDIF employs Sieve to provide data quality evaluation and cleansing. The procedure consists of two separate steps. First, the Data Quality Assessment module assigns each Named Graph within the processed data one or several quality scores based on user-configurable quality assessment policies. These policies combine a assessment function with the definition of the quality-related meta-information which should be used in the assessment process. Then the Data Fusion module takes the quality scores as input and resolves data conflicts based on the assessment scores. The applied fusion functions can be configured on property level. Sieve provides a basic set of quality assessment functions and fusion functions as well as an open interface for the implementation of additional domain-specific functions.
An overview and examples can be found on the Sieve website.
LDIF final and intermediate results can be written to file or to a QuadStore.
- File Output
Two file output formats are currently supported by LDIF:
N-Quads – dumps the data into a single N-Quads file, this file contains the translated versions of all graphs from the input graph set as well as the content of the provenance graph and sameAs-links;
N-Triples – dumps the data into a single N-Triples file, since there exists no connection to the provenance data anymore after outputting it as N-Triples, the provenance data is discarded instead of being output.
- QuadStore Output
Data is written to a QuadStore as SPARQL/Update stream. Here is a list of the supported stores.
The Runtime Environment for the integration component manages the data flow between the various stages/modules and the caching of the intermediate results. In order to parallelize the data processing, the data is partitioned into entities prior to supplying it to a transformation module. An entity represents a Web resource together with all data that is required by a transformation module to process this resource. Entities consist of one or more graph paths and include a graph URI for each node. Each transformation module specifies which paths should be included into the entities it processes. Splitting the work into fine-granular entities, allows LDIF to parallelize the work. LDIF provides three implementations of the Runtime Environment: 1. the in-memory version, 2. the RDF store version and 3. the Hadoop version. Depending of the size of your data set and the available computing resources, you can choose the runtime environment that best fits your use case.
- Single machine / In-memory
The in-memory implementation keeps all intermediate results in memory. It is fast but its scalability is limited by the amount of available memory. For instance, integrating 25 million triples required 5 GB memory within one of our experiments. Parallelization is achieved by distributing the work (entities) to multiple threads.
- Single machine / RDF Store
This implementation of the runtime environment uses an Jena TDB RDF store to store intermediate results. The communication between the RDF store and the runtime environment is realized in the form of SPARQL queries. This runtime environment allows you to process data sets than don’t fit into memory anymore. The downside is that the RDF Store implementation is slower as the In-memory implementation.
- Cluster / Hadoop
This implementation of the runtime environment allows you to parallelize the work onto multiple machines using Hadoop. Each phase in the integration flow has been ported to be executable on a Hadoop cluster. Some initial performance figures comparing the run times of the in-memory, quad store and Hadoop version against different data set sizes are provided in the Benchmark Wiki page.