Title: DSpace2 storage-fedora module implementation (Initially: Fedora DAO implementation for DSpace, beta release)

Student: Andrius Blažinskas

Mentor: Richard Rodgers

About

Project DSpace2 storage-fedora module implementation is a storage module allowing DSpace store its data to Fedora repository. Targeted versions are DSpace 2.x and Fedora 3.x (during development Fedora 3.2.1 was used).

After discussion with community members, it was decided to abandon GSOC2008 work on DSpace 1.x (DSpace & Fedora Integration) and continue this work on DSpace 2.x. The data model in DSpace 2.x is different so mapping part was remade. The same way code heavily reorganized to reflect changes and to prepare it as DSpace 2 module.

Development plan/progress

DSpace 2 data model

Figure 1: General DSpace 2 data model (http://smartech.gatech.edu/dspace/bitstream/1853/28078/5/214-578-1-PB.pdf)

Figure 2: Example DSpace 2 data model implementation (http://smartech.gatech.edu/dspace/bitstream/1853/28078/5/214-578-1-PB.pdf)

Model mapping

Figure 3: Proposed model mapping

Mapping notes:

Other potentially useful Fedora predicates to be implemented:

Entities

DSpace 2 data model entities "marked" with property http://www.w3.org/1999/02/22-rdf-syntax-ns#type = info:fedora/fedora-system:def/model#FedoraObject are mapped to Fedora objects. Entities having property http://www.w3.org/1999/02/22-rdf-syntax-ns#type = FedoraObjectDatastream are indirectly mapped (binary property has direct datastream mapping) to Fedora objects datastreams. Entities having no #type property, by default are mapped to Fedora objects. Datastream dependence to object is indicated using info:fedora/fedora-system:def/recovery#pid property.
All necessary administrative Fedora object and datastream properties are taken from corresponding entity properties. If multiple properties with same name exist and only one is needed - first one is taken.

<!--
Datastream dependence to object is indicated using info:fedora/fedora-system:def/view#hasDatastream relation. Datastream entites must have exactly one file (binary type) property (datastream itself).

Format type entities having http://www.w3.org/1999/02/22-rdf-syntax-ns#type = http://purl.org/dspace/model#Format property are mapped to Fedora objects. Its RELS-EXT is supplemented with later property for fast supported formats listing (possibly in DSpace UI, when user needs to select mimetype for file).
-->

Properties

Properties of DSpace 2 entities are mapped to Fedora RELS-EXT, RELS-INT, DC datastream entries and separate datastreams. If property has name http://purl.org/dspace/model#ContentFile, is binary type (InputStream java class) and is located in FedoraObjectDatastream entity, then it will directly result as a datastream. Only one http://purl.org/dspace/model#ContentFile property is allowed per FedoraObjectDatastream entity. Any string property starting with http://purl.org/dc/elements or http://www.openarchives.org/OAI/2.0/oai_dc/ will end up in DC datastream. Any other non DC and non administrative (administravite starts with info:fedora) string property will go into RELS-EXT for FedoraObject entities and RELS-INT for FedoraObjectDatastream entities.
String properties can be freely defined by user which may not provide namespace, so in such cases "local" namespace http://localhost/model# will be forced.

Relations

Relations between DSpace2 FedoraObject entities are directly mapped to Fedora relations between objects, which in turn are put in RELS-EXT datastream. Relations pointing from datastreams are defined in RELS-INT. In diagram, relation info:fedora/fedora-system:def/relations-external#hasDatastream has no direct mapping and currently does not participate in any way. Using current mapping, DSpace2 relations in Fedora generally can result in any combination: object-to-object, object-to-other-object-datastream (in RELS-EXT); datastream-to-datastream, datastream-to-object (in RELS-INT), etc. While relations between datastreams in different objects may not be very correct, it is left up for the user to choose the resulting model implementation specifics including relation types.

Where are a lot of relations types defined out there, but in storage-fedora module they can also be freely defined by user. If namespace is not provided for particular relation type, local namespace http://localhost/model# will be forced.

Example of child objects RELS-EXT content fragments:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/dspace:Book~1">
    <locatedIn xmlns="http://localhost/model#" rdf:resource="info:fedora/dspace:Library~1"/>
 </rdf:Description>
</rdf:RDF>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/dspace:Book~2">
    <locatedIn xmlns="http://localhost/model#" rdf:resource="info:fedora/dspace:Library~1"/>
  </rdf:Description>
</rdf:RDF>

Example ITQL query for fast child selection (Fedora resource index must be turned on):

select $subject from <#ri>
where  $subject <http://localhost/model#locatedIn>
   <info:fedora/dspace:Library~1>

Example CSV response to it:

"subject"
info:fedora/dspace:Book~1
info:fedora/dspace:Book~2

When designing DSpace2 model implementation, designer (user) should also keep in mind, that entities relations pointing from parent to child can be inefficient, since parent entities usually tend to have a lot of child entities (consider the example of parent Library and child Book above). If parent references all of its children, parent Fedora object will possibly have large rapidly changing and growing number of RELS-EXT entries. This problem does not arise in child to parent referencing.

<!--
There are some things to note, which user must keep in mind creating relations in DSpace2 model implementation. DSpace 2 model may have various relation types between entities, for example: "hasBook", "hasFile", "isResearcherAt", "scannedBy". In general, if parent entity has relation to child entity, then this relation can be called "hasChild" and from child perspective it may be "isChildOf". So basically child can have reference in its RELS-EXT to parent the same way parent may have reference in its RELS-EXT to child. Problematic is the second case, because parent entities usually tend to have a lot of child entities (consider the example of parent Library and child Book above), thus if it references all of its children, parent object will possibly have rapidly changing and growing number of RELS-EXT entries, which may be inefficient. This problem does not arise in child to parent referencing.

In this DSpace2-Fedora3 model mapping, it is proposed that if not defined separately by user, Fedora objects (represented entities) by default will be related with directional child-to-parent relation, despite relation name.
-->

Identifiers

It is very likely, that organizations using Fedora, may prefer using their custom Fedora objects PIDs and DSIDs (datastream IDs), so implemented storage-fedora module does allow this functionality. User himself must ensure uniqueness of custom identifiers. DSpace entity identifier must have form of info:fedora/PID for objects and info:fedora/PID/DSID for datastreams, so that it can be interpreted correctly by storage-fedora module. Incorrect entity identifier (incompatible with Fedora resource URI) will result in error. If Fedora object or datastream identifier in not provided - one will be generated automatically.

<!--
It is very likely, that organizations using Fedora, may prefer using their custom Fedora objects PIDs and DSIDs (datastream IDs), so it is proposed that in storage-fedora module Fedora objects (mapped DSpace2 entities) identifiers can be configurable by user. In this case, user himself must ensure uniqueness of custom identifiers. Also there will be a mechanism allowing generating default PIDs and DSIDs without user intervention.
-->

Fedora PID namespace, used for automatic PID generation, is configurable and predefined in storage-fedora module configuration file.

Concerned about having pids contain any semantic meaning, discussions to date concerned having pids always be opaque to the application, the best example to support this would be the usage of uuids or fedora ids out of the box. please be cautious about the proposed usage above. Use of other properties will be more appropriate to determine the object type from (rdf:type or dc:type for instance). --Mark Diggory 22:37, 12 July 2009 (EDT) |

Identifiers having form <namespace>:<Entity name>~<UUID> and <namespace>:<UUID> were decided not to be used, thus removed from wiki. Though UUIDs are quite attractive and possibly will have more attention in future. --Andrius Blažinskas 00:46, 30 July 2009 (GMT+2) |

Versioning

Datastream versioning is important feature in Fedora what DSpace 2 could take advantage of. Fedora can version all datastreams, so basically both - binary files and RELS-EXT & RELS-INT (DSpace metadata and relations) can be versioned. The problem here is that a lot of time scattered changes in one datastream will result in lot of its copies, because Fedora simply keeps every changed version. This can be complicated when datastreams are relatively big and change rapidly.

Work on versioning for storage-fedora currently is in progress.

Where if REL-EXT supports versioning, then the majority of encoded DSpace metadata and relationships would be versioned as a unit for each DSpace Object. --Mark Diggory 22:41, 12 July 2009 (EDT) |

Implementation details

storage-fedora module is implemented in similar way storage-jackrabbit is. Currently module implements org.dspace.providers.StorageProvider, org.dspace.services.mixins. StorageWriteable/StorageVersionable and org.dspace.kernel.mixins.ShutdownService.
Most recent code of storage-fedora will be available at http://scm.dspace.org/svn/repo/modules/storage-fedora/.

Comments

DSpace+2.0 Developer Recommendations

We propose using RELS-EXT to store the majority of DSpace Properties and Relations for a DSpace+2.0 Entity. The Goal we are hope to see attained is to have DSpace 2.0 act as a Management Toll on exisitng Fedora Repository Content that may have not come from DSpace in the first place, this means

  1. No DSpace centric metadata formats stored in separate bitstreams
  2. Use of RELS-EXT for all relations in DSpace+2.0
  3. Use of dc metadata datastream for any Dublic Core Elements
  4. Use of RELS-EXT for any other metadata properties
  5. Use of RELS-INT to identify relationships that are data files

Consider that there are efforts to map Fedora to JCR and we should consider these in the approriate mappings to DSPace 2.0 / JCR and Fedora (I will try to add more detail on this shortly) --Mark Diggory 16:16, 12 July 2009 (EDT)

''Caution against the use of the following expressed namespace "http://purl.org/dspace2/model/relations/local" the relations already have their own namespace appropriate (FoaF, ORE, DCMI, etc). The only place that a "dspace" specific namespace will probably be employed in DSpace+2.0 is to capture cases where legacy DSpace data model cannot be mapped explicitly to an already existing ontology from one of the various communities. --Mark Diggory 22:35, 12 July 2009 (EDT)

References

DSpace2 model and demo by Ben Bosman: http://smartech.gatech.edu/dspace/handle/1853/28078, http://presentations.dlpe.gatech.edu/or09/or09_052009_3/index.html

DSpace2 RDF: http://wiki.dspace.org/index.php/DSpace+2.0/Expressing_DSpace_Domain_Model_In_RDF

JCR for Fedora mappings: http://jcr-connect.at.northwestern.edu/en/JCR_for_Fedora_-_Discussion

Project code is available at: http://scm.dspace.org/svn/repo/modules/storage-fedora