Introduction

Exchanging repository contents

The most sites on the Internet are oriented towards human consumption. While HTML may be a good format to create websites it is not a good format to export data in a way a computer can work with. Like the most software for repositories DSpace supports OAI-PMH as an interface to export the stored data. While OAI-PMH is well known in the field of repositories it is rarely known elsewhere (e.g. Google retired its support for OAI-PMH in 2008). The Semantic Web is an generic approach to publish data on the Internet together with information about its semantics. The W3C released standards like RDF or SPARQL for publishing structured data on the Web in a way computers can easily work with. The data stored in repositories is particularly suited to be used in the Semantic Web, as metadata is already available. It doesn’t have to be generated or entered manually for publication as Linked Data. For most repositories, at least for Open Access repositories, it is quite important to share their stored content. Linked Data is a rather big chance for repositories to present their content in a way it can easily be accessed, interlinked and (re)used.

Terminology

We don't want to give a full introduction into the Semantic Web and its technologies here as there can by found many on the web. Nevertheless we want to give a short glossar about the terms used most often in this content to make the following documentation more readable.

Semantic Web	The term "Semantic Web" refers to the part of the Internet containing Linked Data. As in the World Wide Web the Semantic Web is created by links between the data.
Linked Data Linked Open Data	Linked Data is used for data in RDF, following the Linked Data Principles. The Linked Data Principles describes expected behavior by data publishers that shall ensure that the data published is easy to find, easy to retrieve, can be linked easily and links to other data as well. Linked Open Data is Linked Data published using an open license. Technically there is no difference between Linked Data and Linked Open Data (often abbreviated as LOD), it is only a question of the license used to publish.
RDF RDF/XML Turtle N-Triples N3-Notation	RDF is an acronym for Resource Description Framework, a meta data model. Don't think of RDF as a format, as it is a model. Nevertheless there are different formats to serialize data following RDF. RDF/XML, Turtle, N-Triples and N3-Notation are probably the most known formats to serialize data in RDF.
Triple Store	A triple store is a database to natively store data following the RDF approach.
SPARQL	The SPARQL Protocol and RDF Query Language is a protocol to query triple stores. Since SPARQL version 1.1 it can be used to manipulate triple stores as well, to store, delete or updata data in triple stores.
SPARQL endpoint	A SPARQL endpoint is an SPARQL interface of a triple store. Since SPARQL 1.1 a SPARQL endpoint can be read-only, allowing to query the stored data only or it can be read-writable allowing to modified stored data as well.

Linked (Open) Data Support within DSpace

Starting with DSpace 5.0 DSpace supports to provide stored contents as Linked (Open) Data.

Architecture / Concept

To publish content stored in DSpace as Linked (Open) Data the data has to be converted into RDF. The conversion into RDF has to be configurable as different DSpace instances may uses different meta data schemata, different persistent identifiers (DOI, Handle, ...) and so on. Depending on the content to convert, the configuration and other parameters the conversion may be time and performance intensive. Contents of repositories is much more often read then created, deleted or changed as the main target of repositories is to safely store their contents. For this reasons content stored within DSpace is stored in a triple store after conversion. The triple store serves as a cache and provides a SPARQL endpoint to make the converted data accessible using SPARQL. The conversion is triggered by a consumer of the DSpace event system and can be started manually using a command line interface (both are documented below). The triple store can be deleted at anytime as all data stored in the triple store can be restored out of the contents stored in DSpace else-where (in the assetstore(s) and the database).

Beside the SPARQL endpoint the data should be published as RDF serialization as well. With dspace-rdf DSpace offers a module that loads converted data from the triple store and provides it as RDF serialization (it currently supports RDF/XML, Turtle and N-Triples). Repositories use Persistent Identifiers to make content citable and to address contents. Following the Linked Data Principles DSpace uses Persistent Identifier in the form of HTTP(S)-URIs, converting a handle to http://hd.handle.net/<handle> and a DOI to http://dx.doi.org/<doi>. Bringing it all together the Linked Data support of DSpace extends all three Layers: the storage layer with a triple store, the business logic with classes to convert stored contents into RDF and the application layer with a module to publish RDF serializations. As you can use DSpace with Oracle or Postgresql you may choose between different triple stores. The only requirements are that the triple store must support SPARQL 1.1 Query Language and SPARQL 1.1 Graph Store HTTP Protocol as DSpace uses them to store, update and delete converted data in the triple store and the triple store shall provide a public read-only SPARQL endpoint.

Store public data only in the triple store!

The Triple Store should contain only data that is public as the access restriction of DSpace won't affect the SPARQL endpoint. For this reason DSpace converts only archived, discoverable (non-privat) Items, Collections and Communities that are readable for anonymous users. Please consider this while configuring and/or extending DSpace's Linked Data support.

The package org.dspace.rdf.conversion contains the classes used to convert the repository's content to RDF. The conversion itself is done by plugins. The interface org.dspace.rdf.conversion.ConverterPlugin is really simple, so take a look if you can program Java and want to extend the conversion. The only thing important is, that plugins must only create RDF that can be made publicly available as the triple store provides it using a sparql endpoint for which DSpace's access restrtictions do not apply. Plugins converting meta data should check whether as specific meta data field needs to be protected or not (see org.dspace.app.util.MetadataExposure on how to check that). The MetadataConverterPlugin is heavily configurable (see below) and is used to convert metadata of Items. The StaticDSOConverterPlugin can be used to add static RDF Triple (see below). The SimpleDSORelationsConverterPlugin creates links between items and collections, collections and communities, subcommunitites and their parents and between top-level communities and the information representing the repository itself.

As different repositories uses different persistent identifiers to address their content, different algorithms to create URIs used within the converted data can be implemented. Currently HTTP(S)-URIs of the repository (called local URIs), handles and DOIs can be used. See the configuration part of this document for further information. If you want to add another algorithm, take a look on the interface org.dspace.rdf.storage.URIGenerator.

Install a Triple Store

In addition to a normal DSpace installation you have to install a triple store. You can use any triple store that supports SPARQL 1.1 Query Language and SPARQL 1.1 Graph Store HTTP Protocol. If you do not have one yet, you can use Apache Fuseki. Download Fuseki from its official download page and unpack the downloaded archive. The archive contains several scripts to start fuseki. Use the start script appropriated for the OS of your choice with the options '--localhost --config=<dspace-install>/config/modules/rdf/fuseki-assembler.ttl'. Instead of changing into the directory you unpacked fuseki to, you may set the variable FUSEKI_HOME. If you're using Linux, unpacked fuseki to /usr/local/jena-fuseki-1.0.1 and installed DSpace to /var/dspace this would look like this:

 export FUSEKI_HOME=/usr/local/jena-fuseki-1.0.1 ; $FUSKI_HOME/fuseki-server --localhost --config [dspace-install]/config/modules/rdf/fuseki-assembler.ttl

Fuseki's archive contains a script to start fuseki automatically at startup as well.

The configuration provided within DSpace makes it store the files for the triple store under [dspace-install]/triplestore. Using this configuration, Fuseki provides three SPARQL endpoints. Two read-only sparql endpoint and one that can be used to change the data of the triple store. You should not use this configuration and let Fuseki connect to the internet directly as it would make it possible for anyone to delete, change or add information to the triple store. The option --localhost tells fuseki to listen only on the loopback device. You can use Apache mod_proxy or any other web or proxy server to make the read-only SPARQL endpoint accessible from the internet. With the configuration described Fueski listen to the port 3030 using http. Using the address http://localhost:3030/ you can connect to the Fuseki Web UI, http://localhost:3030/data addresses a writeable SPARQL 1.1 HTTP Graph Store Protocol endpoint, and http://localhost:3030/get a read-only one. Under http://localhost:3030/sparql a read-only SPARQL 1.1 Query Language endpoint can be found. The first of these endpoints one must be not accessable by the internet, while the last one should be accessible publicly.

All Versions

DSpace Documentation

Page tree