Table of Contents

Feature Community

Feature Steward: TBD
Knowledge Gardener: Daniel Davis
Feature Evangelist: TBD

Member List:

Support for Hierarchical Storage

There has been a long outstanding need for support of hierarchical storage in the Fedora Repository and related components. Recent events have underscored that this need has again moved to the forefront with increasing use of the Fedora Repository in research applications. Examples include UPEI's Virtual Research Environment (VRE) deployments using Islandora, the Max Planck Institute's eSciDoc (Fiz Karlsruhe) and the emerging NSF Cyberinfrastructure program. Other products such as the DICE iRODS and SDSC SRB have provided a means to virtualize storage including support for hierarchical storage. Increased interest in using Fedora to store high performance computing (HPC) data has been reported from a number of institutiond. There have also been requests for the addition of this feature from institutional repository and humanities researchers who have large collections.

In this space, we will be exploring the requirements related to supporting hierarchical storage and enabling the community in adding support to Fedora Commons components. The goal of this work is to produce a Hierarchical Storage Manager (HSM) integration for Fedora Commons.

Characteristics of Hierarchical Storage

Also known as tiered storage, the driving rationale for using hierarchical storage is the notion that costs can be reduced by storing all or part of collection of files (bitstreams) on lower performing, less expensive storage technologies while keeping a copy of some part of the collection on a high performing, more expensive storage technology for immediate use. There may be any number of tiers in the hierarchy but the most common implementations consist of one or two tiers of random access disk followed by tape storage as the lowest tier. The hierarchical storage manager implements some policy for determining on what tier files are stored to meet system goals and moving files between tiers as dictated by the policy. In real world implementations, many other needs may be fulfilled by hierarchical storage such as backup, replication, high availability and disaster recovery. However, it is cost that underlies any decision to deploy hierarchical storage. At this time, magnetic tape provides the lowest cost in most deployment scenarios.

The Fedora architecture presents some unique problems and opportunities in supporting hierarchical storage. We hope that this forum can inform the design for a hierarchical storage manager integration which fits well as part of the overall Fedora architecture.

Special Aspects of the Fedora Commons Architecture

The Fedora Commons Architecture is most strongly represented by the Fedora Repository. The Repository acts as a spanning (or mediation) layer which encapsulates the way the content is accessed. The architecture is not dependent on the common notion of a "directory of files" which dominates thought on how content is managed including the defacto Web architecture. While applications can still use the Fedora Repository as if it were based on a "directory of files" notion, access to content is virtual and uses dissemination services as access endpoints. There is no guarantee of a one-to-one relationship between a file and a dissemination; dissemination services may be quite complicated. Normally applications should not depend on trying to circumvent Fedora to directly access the file system.

Existing HSMs are largely "file-oriented" and are firmly rooted in the "directory of files" convention though many provide a virtual view of physical storage. Underneath the hood, the Fedora Repository uses the "directory of files" convention for physical storage of managed content and metadata through the Akubra plug-in module. In fact, a Fedora Repository can be entirely rebuilt from its managed files. It appears feasible for the Fedora Repository to delegate physical storage to an HSM. However, it is likely that the Fedora Repository will require modification in order to provide a robust integration with HSMs since applications should not depend on knowledge of how the disseminations are accomplished.

#trackbackRdf ($trackbackUtils.getContentIdentifier($page) $page.title $trackbackUtils.getPingUrl($page))
  • No labels