Curation System for DSpace 1.7

This document is a high-level - but developer-focused - introduction to the curation system being proposed for DSpace 1.7 It presumes knowledge of java and DSpace internals.

Tasks

The goal of the curation system ('CS') is to provide a simple, extensible, way to manage
routine content operations on a repository. These operations are known to CS as 'tasks', and they
can operate on any DSpaceObject (i.e. subclasses of DSpaceObject) - although
the first incarnation will only understand Communities, Collections, and Items - viz. core
data model objects. Tasks may essentially work on only one type of DSpace object - typically
an item - and in this case they may simply ignore other data types (tasks have the ability to
'skip' objects for any reason). The DSpace core distribution ought to provide a number of useful
tasks, but the system is designed to encourage local extension - tasks can be written
for any purpose, and placed in any java package. What sorts of things are appropriate tasks?
Some examples:

A task can be arbitrary code, but the class implementing it must have 2 properties:

  1. it must provide a no-arg constructor, so it can be loaded by the PluginManager

Thus, all tasks are 'named' plugins, meaning that each must be configured in dspace.cfg as:

plugin.named.org.dspace.curate.CurationTask = \
org.dspace.curate.ProfileFormats = format-profile \
org.dspace.curate.RequiredMetadata = req-metadata \
org.dspace.ctask.replicate.Audit = audit \
org.dspace.ctask.replicate.Estimate = estimate \
org.dspace.ctask.replicate.Generate = generate \
org.dspace.ctask.integrity.Checksum = checksum \
org.dspace.ctask.integrity.ClamScan = vscan

The 'plugin name' (audit, estimate, etc) is called the task name, and is used instead of the qualified class name
wherever it is needed (on the cmd line, etc) - the CS always dereferences it.

  1. implements 'org.dspace.curate.CurationTask'

The CurationTask interface is almost a 'tagging' interface, and only requires a few very high-level methods be implemented. The most significant is:

int perform(DSpaceObject dso);

The return value should be a code describing one of 4 conditions:

If a task extends the AbstractCurationTask class, that is the only method it needs to define.