You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Table of Contents

* DRAFT, currently being edited!*

The REST API, Fedora resource identifiers, The Resource index, Named Graphs and the semantic web.




Fedora in the context of the Semantic web and Linked Data

The basic premise of this proposal is to support exposing Fedora resources and their relationships in a Semantic Web and Linked Data friendly way. This means

  • publishing dereferencable http URIs for resources
  • publishing of relationships between resources using these identifiers


The new REST API is a move forward in supporting these requirements as we now have dereferencable http URIs for Fedora resources.

What this proposal is not about

  • Implementation of a semweb publishing mechanism
    • "What" a Fedora object/datastream URI identifies depends on how an individual repository is built
      • Compound vs Complex objects
      • Aggregations, collections
  • Not an implementation of OAI-ORE
  • But it provides mechanisms to support these

Current situation

  • Identifiers
    • Fedora resources have identifiers such as namespace:pid and namespace:pid/datastream, and their info:fedora/ URI forms (and similarly for disseminations).
      • These identifiers are effectively scoped to a repository installation
    • The new REST API provides globally dereferencable http URIs for resources, but these are not "defined" as (canonical) identifiers for resources.
    • The existing "LITE" APIs also provide resolvable URI resource identifiers
  • Relationships
    • The resource index is a single "graph" containing relationships for all objects
    • Relationships must have either a Fedora object or datastream as the subject
      • Limits metadata expression to "flat" schemes such as DC
    • No support for "arbitrary" RDF datastreams in the resource index (eg for implementing additional RDF metadata schemes)
    • Resource identifiers used in relationships are of the info:fedora/ form
      • Difficult to "interpret" relationships outside of the scope of the repository
    • The "specification" of what relationships exist for an object is defined in imperative code

Some Principles

Some basic principles that should be followed by the recommendations below:

  • The definitive source of all information should be in Fedora objects
    • No direct manipulation of the triple store - the triple store is a cache/index
  • The info:fedora URI scheme should be retained
    • supports re-use of objects across different repositories
  • As far as possible Fedora should "work" without using a resource index.

Proposals

Deprecate the "LITE" APIs.

  • Implement by using HTTP status code 301: moved permanently in the next release
  • Remove in subsequent release

Define canonical dereferenceable URIs for Fedora resources

  • Using the new REST API URIs

Restructure the resource index as named graphs

A named graph is a set of triples named by a URI.
For instance, the relationships contained in the object myns:somepid could be identified as the graph <#myns:somepid>. Similarly the relationships expressed by the datastream myns:somepid/RELS-EXT could be identified as <#myns:somepid/RELS-EXT>.
Triple query languages such as SPARQL and iTQL support queries across multiple graphs. Using this to query relationships over the repository as a whole would be complex - it would be painful to have to assemble a list of named graphs to query against.
Mulgara is a quad store, relationships are effectively stored as <graph> <subject> <predicate> <object>. Currently all triples are stored in a single <#ri> graph.
Mulgara supports creating models (graphs) that are views of other models (graphs), eg

  • <#myns:somepid/RELS-EXT> containing RELS-EXT relationships
  • <#myns:somepid/DC> containing Dublin Core relationships
  • <#myns:somepid/properties> containing relationships created from object properties
  • <#myns:somepid> as a view defined as the union of the above - this view "contains" all of the relationships for the object
  • A view could be created for all relationships in the repository, as a union of all individual object views. This would be equivalent to the current <#ri> graph.


Thus, a hierarchy of named graphs could be created, for example:
<#ri> - a view containing:
<#some:pid> - object graph for some:pid, a view containing:
<#some:pid/properties> - graph containing object properties
<#some:pid/datastreams> - a view containing:
<#some:pid/datastreams/rels-ext> - graph containing triples from rels-ext
<#some:pid/datastreams/rels-int> - graph containing triples from rels-int
<#some:pid/datastreams/dc> - graph containing triples from DC
<#some:pid/datastreams/

Unknown macro: {rdf datastream}

> - graph containing triples from some other rdf datastream
<#some:pid/datastreams/

Unknown macro: {dsid}

/properties> - graph containing properties of datastream

(state, last modified, etc)
<#some:otherpid> - object graph for some:otherpid, a view containing:
<#some:otherpid/properties> - etc
<#some:otherpid/datastreams> - etc
All graphs should be "rooted" in the above structure, there should be no means of creating graphs other than by creating objects and datastreams.

Why do this?

When an object is created (updated, deleted), the object's relationships are propagated to the triple store. If two objects are created expressing an identical relationship, a single triple will be created in the resource index.. If one of those objects is then deleted, the triple will be deleted from the triple store even though it is still being asserted by another object. The resource index will not be an accurate reflection of the triples in the repository. Hence the current restrictions on RELS-EXT and RELS-INT that subjects must be the Fedora object or datastreams from the containing object, to prevent two objects asserting the same relationship.
With named graphs, relationships created by different objects would be in different graphs. Deleting one object would remove the graph for that object - but the graph for a different object asserting the same relationship would remain - the resource index would be an accurate reflection of the triples in the repository.
Therefore this would support indexing of arbitrary RDF metadata datastreams in the resource index - for instance supporting metadata schemes that are not "flat"

Questions and issues

  • The graph hierarchy to use - how granular? Start with something simple?
  • Mapping between resource identifiers and graph names
  • Separation of "core" relationships from "user-defined" relationships into different overall views? If the intention of the resource index is to store relationships between objects, we may not want to pollute that with other relationships, eg from arbitrary RDF datastreams
    • Relationships about the object and its datastreams - in <#ri>
    • Relationships from RELS-EXT, RELS-INT, DC - in <#ri>
    • Relationships from arbitrary RDF datastreams/disseminators - in <#riUser>
    • <#riFull> as a union of <#ri> and <#riUser>
  • Performance. Need to evaluate query performance over a network of named graphs vs storing all relationships in one single graph
  • Triple store support: Mulgara supports named graphs and views, what about other triple stores? MPTStore?
  • Impact on Mulgara's free-text index, do we create a parallel structure of free text graphs? Does Mulgara even support this?

Declarative specification of triples to create in the resource index

Triples are currently created for

  • object properties
  • datastream properties
  • reserved datastreams that contain RDF (RELS-*)
  • reserved datastreams are translated to RDF (DC)
  • relationships between objects and their datastreams and disseminators
  • relationships between objects and their content models


The "specification" of what triples get generated is largely in imperative Java code, both in terms of the individual triples and which datastreams generate triples.
In the future we may wish to allow creation of triples from

  • arbitrary RDF datastreams
  • arbitrary XML datastreams to be "lifted" to triples
  • disseminators serving RDF


To support a flexible and extensible approach, we could define the generation of triples using content models (system and user) and a declarative approach for specifying triples (XSLT, GRDDL[1]).

  • System content model disseminators for generating RDF for
    • Object and datastream properties triples (from the object's serialisation/FOXML)
    • Relationships between objects, datastreams and disseminators (from the object's serialisation/FOXML)
    • XML datastreams (DC)
  • User content models specifying
    • Additional arbitrary RDF datastreams to index
    • RDF disseminators to index
    • Conversion patterns for other XML datastreams and disseminators


Updating of the resource index could then take place by querying the disseminations and datastreams specified by the system and user content models when an object is created, updated or deleted.
[1] GRDDL is a mechanism for Gleaning Resource Descriptions from Dialects of Languages. It is a technique for obtaining RDF data from XML documents and in particular XHTML pages: GRDDL Primer http://www.w3.org/TR/grddl-primer/

Questions and Issues

How to define the above using system and user content models
How to specify the mapping between XML and RDF

Extend the REST API to incorporate relationships

The REST API does not currently implement methods for disseminating and managing relationships.
API methods should be implemented for querying and managing relationships.
For example

  • GET /objects/
    Unknown macro: {pid}

    /relationships - return RDF for all relationships for the object

    • GET /objects/
    /datastreams/DC/relationships - return RDF for the DC datastream


Alternatives to explicit "relationships" URIs could be

  • Use content negotiation, eg Accept application/rdf+xml and use the existing REST URIs
  • Use a "format" URL query string, eg format=rdf
  • Or both...


Modifications could be specified by

  • POST a set of triples to create new ones
  • DELETE a set of triples to be deleted
  • PUT a set of modifications to perform, eg using (a subset of) SPARQL Update [1]


Additionally, or alternatively, "writeable disseminators" could be provided as a generic mechanism to implement this, eg PUT a SPARQL Update to /objects/

Unknown macro: {pid}

/methods/

Unknown macro: {sDefPid}

/relationships?datastream=RELS-EXT
All of the relationship API methods should operate directly on Fedora objects to remove dependency on the resource index - relationship GET methods should query the object directly rather than issuing RI queries.
[1] SPARQL Update - A language for updating RDF graphs: http://www.w3.org/Submission/SPARQL-Update/

Questions and issues

  • REST endpoints to use - explicit relationships URIs vs content negotiation vs URL query string
  • Relationships update specification (SPARQL Update, or ...)
  • Supporting "generic" updates, eg repository-wide relationships methods and methods operating on an object as a whole
    • Subject and predicate can be used to determine what to update for object properties, datastream properties, Dublin Core
    • RELS-EXT, RELS-INT and arbitrary datastreams present a challenge. A triple with a Fedora object as a subject could be stored in RELS-EXT or in an arbitrary RDF datastream. Do we restrict fedora-model and fedora-system predicates to RELS-EXT and RELS-INT?
  • Supporting updates to XML datastreams that get converted to RDF
    • eg updating DC through relationship API methods

Support for dereferencable http URI resource identifiers in relationships

Fedora resources are currently identified using the info:fedora namespace. If resource identifiers are exposed as dereferencable http URIs using the REST API URIs, it would be useful to support these identifiers in relationships. Ie the ability to query and manipulate relationships using both the info:fedora namespace for Fedora resources and the http REST URIs.

REST API

  • Provide the ability to query and manipulate relationships using the REST API http URIs
    • Maybe a URL query string parameter? scope=local for info:fedora, scope=global for http URIs?

RISearch

  • Query using either info:fedora URIs or the REST API http URIs
  • Return results using either info:fedora URIs or the REST API http URIs
  • Some form of query re-writing and result set rewriting?
  • RISearch query string parameter to determine the form of identifiers to use?


A Spanner (Wrench) in the works...

Fedora repositories generally sit behind some form of user interface application.
These applications will (in some cases) expose their own URLs for accessing Fedora resources
Should we instead be providing mechanisms to support exposing these URLs as the canonical http URIs for Fedora resources?

#trackbackRdf ($trackbackUtils.getContentIdentifier($page) $page.title $trackbackUtils.getPingUrl($page))
  • No labels