Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Curation System for DSpace 1.7

This document is a high-level - but developer-focused - introduction to the curation system being proposed for DSpace 1.7 It presumes knowledge of java and DSpace internals.

Tasks

The goal of the curation system ('CS') is to provide a simple, extensible, way to manage
routine content operations on a repository. These operations are known to CS as 'tasks', and they
can operate on any DSpaceObject (i.e. subclasses of DSpaceObject) - although
the first incarnation will only understand Communities, Collections, and Items - viz. core
data model objects. Tasks may essentially work on only one type of DSpace object - typically
an item - and in this case they may simply ignore other data types (tasks have the ability to
'skip' objects for any reason). The DSpace core distribution ought to provide a number of useful
tasks, but the system is designed to encourage local extension - tasks can be written
for any purpose, and placed in any java package. What sorts of things are appropriate tasks?
Some examples:

  • apply a virus scan to item bitstreams (this will be our example below)
  • profile a collection based on format types - good for identifying format migrations
  • ensure a given set of metadata fields are present in every item, or even that they have particular values
  • call a network service to enhance/replace/normalize an items's metadata or content
  • ensure all item bitstreams are readable and their checksums agree with the ingest values

A task can be arbitrary code, but the class implementing it must have 2 properties:

  1. it must provide a no-arg constructor, so it can be loaded by the PluginManager

Thus, all tasks are 'named' plugins, meaning that each must be configured in dspace.cfg as:

plugin.named.org.dspace.curate.CurationTask = \
org.dspace.curate.ProfileFormats = format-profile \
org.dspace.curate.RequiredMetadata = req-metadata \
org.dspace.ctask.replicate.Audit = audit \
org.dspace.ctask.replicate.Estimate = estimate \
org.dspace.ctask.replicate.Generate = generate \
org.dspace.ctask.integrity.Checksum = checksum \
org.dspace.ctask.integrity.ClamScan = vscan

The 'plugin name' (audit, estimate, etc) is called the task name, and is used instead of the qualified class name
wherever it is needed (on the cmd line, etc) - the CS always dereferences it.

  1. implements 'org.dspace.curate.CurationTask'

The CurationTask interface is almost a 'tagging' interface, and only requires a few very high-level methods be implemented. The most significant is:

int perform(DSpaceObject dso);

The return value should be a code describing one of 4 conditions:

  • 0 : SUCCESS the task completed successfully
  • 1 : FAIL the task failed (it is up to the task to decide what 'counts' as failure - an example might be that the virus scan finds an infected file)
  • 2 : SKIPPED the task could not be performed on the object, perhaps because it was not applicable
  • -1 : ERROR the task could not be completed due to an error

If a task extends the AbstractCurationTask class, that is the only method it needs to define.

  • No labels