New Features for the Curation System

Introduced in DSpace 1.7, and expanded in 1.8, the Curation System (CS) is still a comparatively new denizen in the DSpace ecosystem. As more tasks and 'suites' are produced, we are learning a lot about what additional functionality the framework could offer to support more powerful, flexible, and easily implemented tasks. This page is intended to be a place to collect these insights, as well as designs that address these needs. Many new features are already being developed, and we welcome participation in their evolution.

Queue Filtering

CS supports asynchronous operation by allowing curation requests to be written to a persistent queue for later processing. CS simply empties the queue on demand, and processes each request. While formally correct - in the sense that every queued request is processed - various optimizations and efficiency gains may be possible by more active management of the queue. To take one simple case: suppose an expensive operation that needs to be performed only once appears twice or more on a queue. Could we not 'weed' the queue of such duplicates, and still achieve the desired result? To support such 'intelligent' queue management, one must realize that not all strategies will work for all types of queues in all circumstances, thus any solution must be both flexible/extensible (in terms of the logic to manage the queue), and flexible in how it can be invoked in any circumstance. That is it must be optional, and extensible.

Therefore we propose a new interface, with a single method:

public interface TaskQueueFilter {
   Iterator<TaskQueueEntry> filter(Set<TaskQueueEntry>);
}

The filter() method is designed to accept a set of TaskQueueEntries - which is what the TaskQueue 'dequeue()' method returns - and return a (possibly) modified set retrievable through an iterator. The iterator is important (as opposed to just a new set), since it allows (but does not require) the filter to impose an order on the entries. Filters will be applied when 'CurationCli' is invoked (we can add a new, optional, '-f filter' command-line switch) on a particular queue, so flexibility is secured by the ability to set different (or no) filters on different queues. It may be possible to 'chain' filters, but these use-cases would need further definition.

Programs

A common need is to coordinate the activities of multiple tasks against particular object sets: we may wish to ensure one task is performed before another, or only conditionally performed, possibly based on the 'outcome' of another task. CS currently has no ability to specify or enforce these these constraints: in fact it explicitly disavows this. In this situation:

Curator curator = new Curator();
curator.addTask("task1");
curator.addTask("task2");
curator.curate(myDso);

the curator makes no promises that 'task1' will run before 'task2' - it could in fact be reversed. Nor can a task have any way of 'discovering' whether another task has run, so coordination can't be managed in the task logic itself. There are sound reasons why simple ordering is not supported: there are too many 'contingencies' that simple ordering cannot cope with. For example, suppose that in the above case 'task1' has an error and never properly ran - then task2's assumptions would be mistaken.

A more full-featured and robust mechanism than simple ordering is needed: thus the proposal to add task 'programs'. A program is a set of instructions about how and whether to run sets of tasks. The CS will be responsible for 'compiling' and running these programs, and a 'program' will have the exact same semantics as an atomic task. Namely:

It will return a status code with the same value set as tasks
It will optionally return a 'result' string
It will have a locally-bound logical name
It will be possible to invoke a program wherever a task can be - in admin UI, workflow, batch, etc

What would a task program look like - i.e. what is the program syntax, etc.? Here is a straw-man example:

# Task Program Example
# MIT Libraries - January 2013
first-task
if not @SUCCESS
  report "problem out of the gate"
  return @ERROR:"first-task did not succeed"
end
second-task
if @FAIL
   cleanup-task
elif @ERROR
   report "error on second task"
elif @SKIP
   another-task
   if @SUCCESS
      return cleanup-task
   end
else
   cleanup-task
end

Object Selectors

In CS, the unit of curation is a DSpaceObject (which may be an Item, Collection, or Community). Thus the API offers these basic methods (on the Curator class):

public void curate(DSpaceObject object) throws IOException;

public void curate(Context c, String id) throws IOException;

A task may elect to restrict its scope of operation to a particular type or subset of objects (typically, only items, not containers), and can thus apply filters in business logic code to the objects it is given, but often we may wish to perform a given task on a set of objects that do not correspond to any natural container, so filtering will be of no help. For example, we may wish to perform a task on all recently installed items (whatever the collection). We may do this, of course, by writing custom code that pulls the necessary items, then feeds them one-by-one to a curator, but our code is not very portable/repurposable. We could not, e.g., easily use the same code in a command-line context and a UI context, as we have come to expect with CS.

This is the primary motivation for a new feature of the curation API known as object selectors. 'ObjectSelector' is a new interface (which essentially just exposes a DSpaceObject Iterator), that is directly supported by the curation API:

public void curate(ObjectSelector selector) throws IOException;

public void queue(ObjectSelector selector, String queueId) throws IOException;

The curator will perform the configured tasks on all the DSpaceObjects delivered by the selector, and the selector can deliver any set of objects it wishes. As an interface, CS users may write and deploy their own custom selector implementations, but we propose to offer a few general-purpose selector implementations that will be bundled with the curation system. Currently these are:

SearchSelector

This selector invokes the DSpace native (Lucene) search APIs to obtain sets of objects. In this way, one can easily perform curation tasks on any set of search results. For ease of reuse, the search query string can be stored in a configuration file, and each such configuration can be given a different name. This technique, known as 'named selectors', allows for easy integration in other CS tools. For example (in the command-line tool via the DSpace launcher):

[dspace]/bin/dspace curate -o nanotechnology -t textextract

The argument to the '-o' (*o*bjectselector) is the name of a selector, which we can imagine is a search for all the items whose title contains 'nanotechnology'.

It should be noted that SearchSelector can also be used for 'non-canned' searches: we could expose a search box in a web page, have the user type in a search string and configure a search selector to use this 'live' query.

QuerySelector

This selector queries the database to obtain its objects. In essence, the selector transforms a very simplified user-supplied query string into the SQL necessary to perform the database query. An example can illustrate:

in_archive = '1' AND last_modified > ${today - 7} AND dc.contributor.author = 'Jones'

This query would retrieve all items authored by Jones installed within the last week. The actual SQL is more complex, since joins with the metadata tables are required. For the curious, the syntax of the query language is given below (in Extended Backus-Naur Form)

(* Query syntax EBNF *)
  query = expr , { "AND" , expr } ;
  expr = field name | metadata name , oper , value ;
  field name = characters , { "_" , characters } ;
  metadata name = characters , "." , characters , [ "." , characters ] ;
  oper = "=" | "<>" | ">" | "<" | ">=" | "<=" | "BETWEEN" | "LIKE" | "IN" ;
  value = literal | variable ;
  literal = "'" , characters , { whitespace , characters } , "'" ;
  variable = "${" , varname , [ "+" | "-" , number ] , "}" ;
  varname = "today" | handle ;
(* end syntax EBNF *)

Task Recording

Most routine task executions have no lasting or special significance, but some may merit keeping track of. For example, a scan of the Library of Congress page http://id.loc.gov/vocabulary/preservationEvents.html reveals that many preservation events of significance map to currently offered curation tasks. A facility for tracking important tasks may therefore be desirable. CS does emit ordinary DSpace logging messages, but these are interleaved with all other application logging data, so are not suitable for this sort of historical record.

Instead we propose a very simple, but flexible way to monitor task execution, based on a new annotation type for tasks:

@Record
public class ImportantTask extends AbstractCurationTask
...

The presence of this annotation signifies to the CS that when a task of this type is performed, the outcome should be recorded somewhere, if recording has been otherwise activated in the DSpace curation setup. There will be no error (or run-time penalty), if recording has not been activated. The 'outcome' here means the following data elements:

time stamp of task performance
id (handle) of object
EPerson name invoking task (if specified)
logical task name
task status code
task result string (if set)
task 'type' (explained below)
task 'value' (also explained below)

'Recorded' here means only that a class implementing the 'Recorder' interface has been configured. What constitutes 'recording' is up to the implementation, but could include:

logging to a local file
writing to a database
posting to a message queue

The point is to have a 'hook' into the curation runtime where these records can be captured. The release will likely include a simple local file log/journal recorder as a basic starter implementation. A number of default values may be overridden if needed:

@Record(statusCodes={1, 2})

@Record(type="PREMIS",
        value="Replication")

By default, the recording logic will be invoked regardless of the statusCode returned by the task, but in the above case, we limit it to errors and skips. Type and value are very useful when a task's work can be expressed in a controlled vocabulary: it would be easy to generate records, e.g., as RDF statements with the id as subject, and type and value referring to ontology defined terms. A given task may have multiple such descriptions (in different domains, e.g). Since simple annotations cannot be repeated, we must use a 'container' annotation:

@Records({
   @Record(type="PREMIS", value="Replication"),
   @Record(type="LOC" value="duplication")
})

Resource Management

One of the core design objectives of CS was to make tasks as simple to implement as possible: in practice this meant keeping the API 'footprint' (number of methods that a task has to code) very small. In fact, it really only consists of 2 methods:

void init(Curator curator, String taskId) throws IOException;

int perform(DSpaceObject dso) throws IOException;

int perform(Context ctx, String id) throws IOException;

where the third method can usually be converted into the second. One consequence of this is a lack of what one would consider full lifecycle semantics. That is, there is no method by which a task could 'clean itself up' after use. This can entail a few gyrations - or at any rate a certain task design discipline - in some circumstances. Let us take a concrete example: a task that needs to write some data to a stream for each object it receives. The simplest apparent way to code this is:

public class StreamTaskTake1 implements CurationTask
{
   private OutputStream out;

   public void init(Curator curator, String taskId) throws IOException
   {
       out = new FileOutputStream("somewhere", true);
   }

   public int perform(DSpaceObject dso) throws IOException
   {
       .....
       out.write(dso.getHandle().getBytes());
       ....
    }
}

but of course this isn't very satisfactory, since the task never closes the stream it opened. The task has no apparent way of determining when it is called for the last time, so there isn't an obvious way around this. (There are in fact several ways - e.g. the task can annotate itself as @Distributive and have complete control over how it is called, but this can add substantial complexity). So we are usually led to a solution like this:

public class StreamTaskTake2 implements CurationTask
{
   private OutputStream out;

   public void init(Curator curator, String taskId) throws IOException
   {
   }

   public int perform(DSpaceObject dso) throws IOException
   {
       .....
       out = new FileOutputStream("somewhere", true);
       out.write(dso.getHandle().getBytes());
       out.close();
       ....
    }
}

This version is formally correct, and indeed exhibits the quite desirable trait of not holding a file descriptor when not in use, but we might chafe at the thought that we are doing fairly inefficient IO if this task is invoked on a collection of 1000 items by re-opening every time. Thus the idea of curator resource management: suppose we could simply ask the curation system to manage the issue? Like so:

public class StreamTaskTake3 implements CurationTask
{
   private OutputStream out;

   public void init(Curator curator, String taskId) throws IOException
   {
       out = new FileOutputStream("somewhere", true);
       // let the curator worry about this..
       curator.enrollResource(out, "close");
   }

   public int perform(DSpaceObject dso) throws IOException
   {
       .....
       out.write(dso.getHandle().getBytes());
       ....
    }
}

That is, the enrollResource method asks the CS to ensure that when the curator has finished its work, it should call 'out.close()' on the stream. The "close" argument is called the policy, and it is the job of CS to enforce the policy. Currently, we have only looked at 'close' and 'flush' as policies, but it would not be difficult to imagine others.

Page tree

Curation System New Features (Post 1.8)