Table of Contents

Overview

This section is for discussion and description of a proposed "high-level" storage interface. The notion of a new storage interface originated with use cases that were hard to satisfy using the current implementation in Fedora. Subsequent discussions have led to a realization that "storage" is a much larger issue with great potential both within Fedora and within a greater infrastructure. In this section you will find references to the earliest work with respect to Fedora and draft proposals that were aimed towards concrete implementations for the Fedora Repository. You can also find and participate in the wider discussion about "storage" that resulted from our initial attempts to design the Fedora-specific implementation and led to our beginning to understand this subject extends far beyond Fedora. In a nutshell, this forum aims to define an architecture for storage (a.k.a persistence) which is of general use and it suitable to support short and long-term access and persistence of digital assets regardless of the underlying physical mechanism. But also, there needs to be immediate implementation of new persistent components to meet current needs. This forum will also be used to design and facilitate implementation of usable components.

To keep this page short, it will mostly consist of organizing links to other pages. Please note this work will cross link to other participants notable Fedora Create, the Data Conservancy, Policy Driven Repository Interoperability (PoDRI) in conjunction with iRODS, and others who will be named shortly.

Use Cases

This section will lead to use case pages which inform the discussion both Fedora directed and use cases which are beyond Fedora's scope.

Glossary

A common nomenclature is needed to facilitate understanding. It's best to use common terms but consistency within this discussion is more important if the term has multiple definition in common use.

Documents

  • Draft proposal (pdf) - Initial proposal for HighlevelStorage layer. Discusses motivation, separation of concerns, and possible use cases.
  • Aaron's Presentation - Presented at Feb. 2010 ondon meeting. Summarizes proposal, and introduces a possible configuration of internal modules connected by chaining.
  • Asger's Presentation - Presented at Feb. 2010 London meeting. Introduces Writable/Readable store interfaces used for plugging in indexing, caching, etc.
  • DEV:OR '10 extended abstract - Extended abstract submitted to Open Repositories 2010.
  • DEV:OR '10 presentation slides - Slides used for OR'10 Fedora user group presentation

Issues for Discussion

In a nutshell, this proposal aims to remove certain hard-coded storage assumptions in Fedora, and present a storage layer/interface that would allow a place for extensions to Fedora that implement multiplexing, non-blob storage, lock-free updates, cloud storage, etc.

  • Interface to DOManager layer
    • Do we present a single HighLevelStorage interface to the DOManager for reads and writes, or
      do we present Readable and Writable?
  • Datastream versioning.
    • There is a proposal to get rid of datastream versions in the model, and present versioning as a storage layer concern (TODO: get proposal and link to it)
    • If versioning becomes a concern of the storage layer, how does this affect the proposed interface?
      • Add get(PID pid, String version)? Is this sufficient?
  • DigitalObject representation
    • Is the existing DigitalObject interface sufficient? Can it be improved? Should it be replaced?
    • If versioning moves from the object model (datastream versions) to the storage layer, then presumably DigitalObject would have to be updated so that it no longer exposes versions
    • Is there anything that would obviously present a problem to performance or scalability with corner-case objects (lots of datastreams or versions)?
  • Return values
    • Operation or transaction ID?
    • A generic map that any module in the storage layer can populate?
  • Chaining of modules
    • Should HighLevelStorage modules be assembled into a chain, a tree, both?
    • Tree configuration implies a list Writable modules, all seeing the same input.
    • Chain configuration implies modules that implement HighLevelStorage, but also may have a delegate HighLevelStorage class configured beneath
      • May need to pay attention to InputStreams passed up/down chain

Implementation plan

Since the storage layer in Fedora resides beneath the object management (DOManager) layer in Fedora, adopting HighLevelStorage implies creating an alternate DOManager instance that interacts with HighLevelStorage rather than ILowLevelStorage. Ideally, this alternate DOManager would be a simple drop-in replacement for the existing DefaultDOManager. Initial development of HighlevelStorage could then be be largely independent from the core Fedora code, deployment would be enabled by a simple configuration change. Unfortunately, this is not easily possible today due to unnecessary coupling between certain Fedora components, and an abundance of unrelated functionality in DOManager that can/should exist elsewhere. These issues would need to be addressed in order to create a truly pluggable DOManager.

While HighLevelStorage is not scheduled to be a feature of Fedora 3.4, it may be developed concurrently or slightly after the 3.4 release date. As many of the prerequisites for drop-in replacement of DOManager are general improvements to the Fedora code base that are not storage-specific, there is distinct appeal to incorporating these basic improvements into the core in time for Fedora 3.4. With these prerequisites in place, work on HighLevelStorage may proceed entirely as an add-on/replacement module, hopefully without further changes to the core. Combined with Fedora's enhanced modular architecture, it would potentially allow HighLevelStorage to be distributed as an add-on bundle to Fedora 3.4 for evaluation or testing before it becomes a core feature.

Relevant tracker items

Need to figure out how to link to the issues using the new Jira version!

#trackbackRdf ($trackbackUtils.getContentIdentifier($page) $page.title $trackbackUtils.getPingUrl($page))
  • No labels

9 Comments

  1. This has some overlap with the work at Library of Congress on an "Inventory Service".  See their paper A Set of Transfer-Related Services in DLib Magazine Jan/Feb 2009.

    Also - keep an eye to ways in which this might allow us to support transactionality -- at least for the content that we want to store in a a transactional system.

  2. This is my suggestion for the java interface of the Fedora Object Model

    Outstanding questions:
    Relations, should they be properties or still live as datastreams? I feel they should be in a datastream, but the RELS-INT and RELS-EXT should be combined. Still, they are used so often, there should be support functions.
    DataSource, it this the correct way to represent the contents of a datastream?
    Are list of properties the correct way to go, or should we mention the specific properties, like SIZE, FORMAT_URI, STATE, LABEL and so on.

    class DigitalObject{
      
      String pid;
      Date created;
      Date lastModified
    
      List<ObjectProperties> objectProperties;
    
      Map<String,Datastream> datastreams; 
    
    }
    
    class Datastream{
      String ID;
      Date created;
      Date lastModified
      
      List<ObjectProperties> properties;
      
      DataSource contents;
    
    }
    
    class ObjectProperties{
      String name;
      String value;
    
    }
    
    1. I like the idea of the RELS-EXT / RELS-INT... but wonder, if these are not "specific" to just "RELS" but to any RDF in general.

      Per DigitalObject <-> DataStream <-> DataSource>

      We ended up with DSpace StorageService as

      StorageEntity = and Identifier

      StorageProperty = a triple of <StorageEntity> <StoragePropertyName> <Object>

      The PropertyStorageService is Property centric, the BinaryStorageService is Content centric

      http://scm.dspace.org/svn/repo/modules/dspace-storage/trunk/api/src/main/java/org/dspace/services/PropertyStorageService.java

  3. This is my proposal for the storage interfaces.

    interface ReadableStore{
       DigitalObject read(String pid);
    }
    

    Nothing new here, just restating the ReadableStore interface

    Here the fun begins

    interface WritableStore{
    
      Result create(DigitalObject newObj);
      
      Result purge(DigitalObject obj);
    
      Result update(DigitalObject obj);
    }
    

    Issues to note:

    • Why does update not take and old and new DigitalObject as parameters? Easy. The DigitalObject class should store the list of changes, and possible the version number of the old version. The digitalObject object contain a changelog from since it was retrieved by the app.
    • Update and Create could be merged
    • What is Result?

    Result: Information about what has happened and where the digital object is stored.

  4. Transactions seemed to be a big thing to community. These are my current thoughts on that subject

    interface RevertableWritableStore extends WritableStore{
      Result undo(DigitalObject obj);
    }
    

    The purpose of this interface is to adress stores, which can undo changes. Undo differs from update, in that undo should leave the repository in the same state as if the change was never done.

    interface TransactionsStore{
      
      Token startTransaction();
    
      Result commit(Token token);
    
      Result rollback(Token token);
    }
    interface TransactionalWritableStore extends TransactionsStore, WritableStore{
      
      Result create(Token token, DigitalObject obj);
    
      Result update(Token token, DigitalObject obj);
    
      Result delete(Token token, DigitalObject obj);
    }
    

    One or two interfaces, have not decided yet. The point is, one can begin a transaction, do a number of changes as part of that transaction, and commit or rollback the changes.

    interface AsynchWritableStore extends WritableStore{
    
      Result status(Result result);
    
    }
    

    A asynch store should implement this interface. The Result of the three modifying operations will return a Result indicating that the writes are postponed. The status method allows you to use this Result Object to lookup what the current state is.

    1. In the DSpace Service model, there is a certain degree of transactionality captured in the system. In DSpace request = transaction window, services can bind a listener to the request and cause completion of tasks to complete or rollback the transaction.  All "services" in DSpace operate within this transactional window.

      DSpace 2.0 Core Services 

      RequestService

      In DS2 a request is the concept of a request (HTTP) or an atomic transaction in the system. It is likely to be an HTTP request in many cases but it does not have to be. This service provides the core services with a way to manage atomic transactions so that when a request comes in which requires mutliple things to happen they can either all suceed or all fail without each service attempting to manage this independently. In a nutshell this simply allows identification of the current request and the ability to discover if it succeeded or failed when it ends. Nothing in the system will enforce usage of the service but we encourage developers who are interacting with the system to make use of this service so they know if the request they are participating in with has succeeded or failed and take appropriate actions.

      http://scm.dspace.org/svn/repo/dspace2/core/trunk/api/src/main/java/org/dspace/services/RequestService.java

  5. Context

    If the store should perform messaging, authorization or performing an audit trail, it needs to know about the context of the changes. For this reason, EVERY method should take an additional parameter, Context, probably equivalent to the current Fedora class of the same name

  6. I'm always going to step in and talk about how far we got with the StorageService in DSpace 2.0 and the Backporting... Just to get some perspective out there to similar work on that side...

    Original DSpace 2.0 Storage API by Arron Zeckoski

    Read/Write interfaces, versioning, searching, etc.

    http://scm.dspace.org/svn/repo/dspace2/core/trunk/api/src/main/java/org/dspace/services/mixins/

    DSpace 2.0 Modelling Services DSpace 2.0 Expressing DSpace Domain Model In RDF

    GSoC Summer of Code project that lead to a simpler api that is more Triplestore like. Services focus on Metadata vs Binar Data, not Read vs Write.

    http://scm.dspace.org/svn/repo/modules/dspace-storage/trunk/api/src/main/java/org/dspace/services/

    GSOC10 - Backport of DSpace 2 Storage Services API for DSpace 1.x

    GSOC10 - Storage Service Implementations Based on Semantic Content Repository 

    These represent the Storage Backend we would eventually want to see for DSpace 2.x to wire onto a Storage System.  It moves DSpace away from the ridged hardcoded DSpaceObject data model and allows DSpace Applications to define any graph of Objects with properties that represent content, could apply to JCR, Fedora, Triplestores, etc.

    How do we learn from this body of work and bring it into the Fedora HLS work so it can be informed by what is happening in the community at large?

  7. I'd like to be able to provide alternate strategies for copying datastreams from external locations to managed locations, or just between locations more generally. It would be nice if we could achieve something like that without modifications to the high level storage module, only configuring some sort of custom transfer strategy class.

    Perhaps let DataSources be POJOs, then allow various copy strategies to intercept the copy function for DataSources of choice. This would allow more efficient transfers in many cases where streaming through Fedora is not optimal.

    In my particular case I'm thinking of a special sort of strategy for staged file, which is renaming, followed by grid-based replication. We ingest most of our data through a staging server within a grid. So the effect for us can be ingest without having to physically move data at ingest time. We would intercept the copy function whenever both DataSources were locations within the same grid, our strategy would perform a logical move to the managed location, set permissions, then trigger post-runtime replication.