Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

Asset Store Prototype

Installation

Download: attachment:assetstoreproto-20041222.tar.gz

Adds an isolated-read mechanism (AssetStore.retrieveDataObject()), package refactoring, checksum calculation and more testing.

Old files: attachment:dspace2-asl.tar.gz

You'll need to install maven to build the proto.

Then type '{maven test}' to download the relevant libs and run the unit tests. '{maven eclipse}' will generate project and classpath files for eclipse.

Thoughts and Parameters

Lowest storage layer functionality

Allow atomic CRUD of a stream of metadata and optionally a stream of data, keyed by a string identifier

The storage layer need not be OAIS specific (N.B. it must be compliant). To this end the API deals simply in terms of streams of data associated with streams of metadata. Containers (e.g. an OAIS collection or community) would be a metadata stream that described the collection and an empty data stream. Bitstreams will contain the data and some bitstream specific metadata, and have an identifier. In OAIS terms this means that the smallest entity in this asset store model corresponds to an OAIS Data Object.

I'm aiming to have an API that can be implemented by a range of storage technologies to allow implementations based on file systems, distributed storage etc. Placing high requirements on transaction support is one of the things that could prohibitively raise the bar on this, so I've gone for the simplest atom of transaction as the base level of support. It may be that it turns out we need more transaction support as we go through the design. Better to under engineer now.

The atomic transactions are described as Command objects, which I think is a good way of defining and limiting the transaction support an asset store must implement.

Data format / serialization agnostic

The storage layer should be completely agnostic of the data structures and serialization of the metadata stored. We (the digital preservation community) will change our minds about metadata standards in time, as our thinking progresses. This is entirely predictable and we should design tools that don't need a ground up rewrite when it happens.

Versioning done in metadata (and hence not by lowest level)

Versioning is another area we might well change policy on in the future. The only reason I can see for holding the versioning data at the storage level is to achieve diff based storage, which I don't feel is appropriate for a digital preservation system.

Shouldn't assume any semantics in the identifier other than uniqueness

N.B that this doesn't exclude semantic identifiers. The whole debate on identifiers is far from finished. I don't see a reason for the asset storage layer to care as long as the uniqueness is guaranteed.

Notify higher layers of changes

N.B. that the asset layer must regard everything else as a 'higher layer' - it must not depend on any of them.

I've implemented this using event listeners (the Observer pattern). This can lead to poor round trip performance when compared to a polling solution (what Rob calls 'pull'), although the throughput is generally comparable. To remove this performace bottleneck you can make the notifications asynchronous. I've achieved this using the excellent ActiveMQ framework, which is an extremely lightweight JMS (Java Messaging Service) framework. The implementation takes less than 300 lines of code.

The alternative to this is for every asset store implementation to support an index of items against time, and probably a scheduling manager in the application layer to choreograph access to that index to prevent spike loads.

In case anyone missed the discussion on this subject in August: Compared to an asynchronous Observer (push) solution I think the query (pull) solution will: -

  • be harder to implement
  • be less efficient (Locked update of a large index vs serializing a message)
  • handle heavy load less well (blocking threads more expensive than maintaining a queue)
  • handle large asset stores less easily (the central index is a size bottleneck)

That's why I favour the use of the Observer pattern with asynchronous event delivery in this situation.

Checksumming

Because the whole data stream needs to be read to calculate a checksum, it makes sense to make the asset store do it on create, so the rest of the app can simply pass through the InputStream reference.

Enabling layout strategies

Supposing you want your archive to be laid out on a file system such that there is a directory per community, one per collection, one per AIP and so on. Or that you want to insert a meta storage layer that stores one community's assets on a LOCKSS system and another's on an SRB. You'll need "some" semantics about the item stored to enable this. This is enabled by creating a data structure of the relevant information just before you serialise the metadata. At the moment I have this as a Properties object, but I think that a defined structure would be more appropriate.

Replication / Federation / Mirroring

This could either be achieved behind the asset store API by executing transactions on two asset stores (at least one remote), or at an application layer, by adding the mirroring asset store as a listener to changes in the first. My gut feeling is that the latter is superior, as it could force the remote system through a permissions layer.

  • No labels