Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info
titleThis Document is a Work In Progress

NOTICE: This is a COPY of the ReplicationTaskSuite document, currently being used to document a change that is not yet available.

Replication Task Suite

The Replication Task Suite is a DSpace Add-On which provides a set of curation system tasks to assist in performing replication (backup/restore/audit) of DSpace contents to other locations. The DSpace content is packaged in containers known as AIPs (OAIS speak: 'archival information packages'). By default, AIPs are generated in the default DSpace AIP Format (the same format used by the AIP Backup and Restore tool). If desired, there is an option to generate BagIt-based AIPs instead of using the default DSpace AIP format.

...

The Replication Task Suite currently supports the following versions of DSpace software:

Replication Task Suite VersionSupported DSpace Version(s)Supported Java VersionSupported InterfacesNotes
6.0DSpace verxion 6.x or higherJava 7 or aboveXMLUI and/or commandlineThe 6.0 stable version of the Replication Task Suite offers no new functionality over the previous versions. It is simply a refactor of the code to ensure that Replication Task Suite works with DSpace 6.x and later versions - see DS-3389
3.4DSpace version 3.x, 4.x or 5.xJava 7 or aboveXMLUI and/or commandlineThe 3.4 stable version of the Replication Task Suite is nearly identical to the 1.x stable version. It just includes minor bug fixes to ensure the Replication Task Suite is compatible with the newer DSpace APIs.
1.3DSpace version 1.8.xJava 6 or aboveXMLUI and/or commandlineHighly recommended to use either DSpace 1.8.1 or above. DSpace 1.8.0 has a known bug where running a Replication Task will always return a NullPointerException - see DS-1077

Installation instructions for each version are included below:

...

These two AIP formats are not identical.  The below table seeks to describe some of the differences.

 

DSpace AIP Format (METS-based AIPs)

BagIt AIP Format

Supported Backup/Restore Types

 

 

Can Backup & Restore all DSpace Content easily

Yes

Yes

Can Backup & Restore a Single Community/Collection/Item easily

Yes

Yes

Backups can be used to move one or more Community/Collection/Items to another DSpace system easily.

Yes (Using the Replication Task Suite or using the command line AIP Backup and Restore tools)

Yes (though the Replication Task Suite add-on must be installed on both systems)

Can Backup & Restore Item Versions (added in DSpace 3.x)No (Item Versioning not yet compatible with AIP format. Only the most recent version of an Item is described in the AIP.)No (Item Versioning not yet compatible with AIP format. Only the most recent version of an Item is described in the AIP.)

Supported DSpace Object Types

 

 

Supports backup/restore of all Communities/Collections/Items (including metadata, files, logos, etc.)YesYes
Supports backup/restore of all People/Groups/PermissionsYesNo (Not yet supported)
Supports backup/restore of all Collection-specific Item TemplatesYesNo (Not yet supported)
Supports backup/restore of all Collection Harvesting settings (only for Collections which pull in all Items via OAI-PMH or OAI-ORE)No (The harvest settings are not preserved, but previously harvested items are preserved in their own AIPs)No (The harvest settings are not preserved, but previously harvested items are preserved in their own AIPs)
Supports backup/restore of all Withdrawn (but not deleted) ItemsYesYes
Supports backup/restore of Item Mappings between CollectionsYesYes
Supports backup/restore of all in-process, uncompleted Submissions (or those currently in an approval workflow)

No (AIPs are only generated for objects which are completed and considered "in archive")

No (AIPs are only generated for objects which are completed and considered "in archive")

Supports backup/restore of Items using custom Metadata Schemas & FieldsYesYes
Supports backup/restore of all local DSpace Configurations and CustomizationsNo (You are expected to backup your DSpace configurations and customizations separately. AIPs only backup content held within DSpace.)No (You are expected to backup your DSpace configurations and customizations separately. AIPs only backup content held within DSpace.)

 

For more information on the tasks available based on your AIP format choice, please see the Problem Statement and Usage Examples section below. This section also provides good examples of how to use each of the tasks available to you in the Replication Task Suite.

...

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content (called the 'Amazing Images' collection). She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere (e.g. either on a backup drive, or even in the cloud via a service like DuraCloud). She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

First Steps - Estimation

Replication Task Used:

Estimate Storage Space for AIP(s)

Task ID: estaipsize

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole DSpace asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes. The "Estimate Storage Space for AIP(s)" (estaipsize) task will give her this ability.

...

We should warn that the estimates from this task are rather crude, in that they do not measure the actual size of all AIPs. Rather they just total up the bitstream (file) sizes (and do not include metadata files). However, even this crude estimate should provide a decent idea of overall storage needs.

Replicating

Replication Task Used:

Transmit AIP(s) to Storage

Task ID: transmitaip

Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below).  This task is the "Transmit AIP(s) to Storage" (transmitaip) task.

...

Our data curator may elect to perform this task in the DSpace Admin UI, or, if the collection is rather large, she may instead 'queue' the task for later execution by using the queueing facility available in the curation system. We should note that the 'transmitaip' task, like all other replication tasks, operates on whatever DSpace object(s) they are given. Thus, if the object is a collection, the task creates (and transmits, of course) an AIP for the collection object itself (metadata and logo), as well as AIPs for each item in the collection. If the task is given an identifier for a single Item, then only one AIP will be created and transmitted.

Verifying Replication

Replication Task Used:

Verify AIP(s) exist in Storage

Task ID: verifyaip

While the 'transmitaip' task will report on whether or not it was successful in generating and transmitting AIP(s) to the replication service, our data curator wants the ability (within DSpace) to check whenever she likes that the AIP(s) which were transmitted are still there. A simple task "Verify AIP(s) exist in Storage" (verifyaip) can perform this function.

Ensuring Replica Integrity and Accuracy over time

Replication Task Used:

Audit against AIP(s)

Task ID: auditaip

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitaip' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: "Audit against AIP(s)" (auditaip), which can also be thought of as a simple, quick auditing task. When performed on an Item, the task does the following:

...

A set of replication tasks perform these functions, as described below.

Restoring Object(s)

Replication Tasks Used:

Restore Missing Object(s) from AIP(s)

Task ID: restorefromaip

 

Restore Missing Object(s) but Keep Existing Objects (*METS-AIP Only)

Task ID: restorekeepexisting

 

Restore Single Object from AIP (*METS-AIP Only)

Task ID: restoresinglefromaip

If the curator should ever find the need to restore a deleted object, a variety of restoration based tasks are available.  The base task is the "Restore Missing Object(s) from AIP(s)" (restorefromaip) task.

...

  • Restore Single Object from AIP (restoresinglefromaip)
    • This task acts the same as the default "restorefromaip" task, but it does NOT restore any child objects. So, if it is run on a collection, just the collection itself will be restored (items in that collection will not be restored).
  • Restore Missing Object(s) but Keep Existing Objects (restorekeepexisting)
    • This task acts similar to the default "restorefromaip" task, but it attempts to skip over any objects which already exist in the repository. In other words, an error is not thrown if an object already exists – rather that entire object (and all its child objects) are skipped over during processing and left unchanged. This mode is identical to the "Keep Existing" mode of the DSpace AIP Backup and Restore tool.

Replacing Object(s)

Replication Tasks Used:

Replace Existing Object(s) with AIP(s)

Task ID: replacewithaip

 

Replace Single Object with AIP (*METS-AIP Only)

Task ID: replacesinglewithaip

If the curator should ever find a need to replace a corrupted object or revert an existing object back to the version in remote storage, a variety of replacement tasks are available.  The base task is the "Replace Existing Object(s) with AIP(s)" (replacewithaip) task.

...

  • Replace Single Object from AIP (replacesinglewithaip)
    • This task acts the same as the default "replacewithaip" task, but it does NOT replace any child objects. So, if it is run on a collection, just the collection metadata will be replaced (items existing in that collection will not be replaced).

Cleanup

Replication Task Used:

Remove AIP(s) from Storage

Task ID: removeaip

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'Remove AIP(s) from Storage' (removeaip) will permanently delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be permanently deleted.

Keeping Score

Replication Task Used:

Read Odometer

Task ID: readodometer

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'Read Odometer' (readodometer). This task simply displays the readings from the replication system that records cumulative use. The statistics are:

...