All Versions
- DSpace 7.x (Current Release)
- DSpace 8.x (Unreleased)
- DSpace 6.x (EOL)
- DSpace 5.x (EOL)
- More Versions...
As of release 1.7, DSpace supports running curation tasks, which are described in this section. DSpace 1.7 and subsequent distributions will bundle (include) several useful tasks, but the system also is designed to allow new tasks to be added between releases, both general purpose tasks that come from the community, and locally written and deployed tasks.
...
Since tasks have access to, and can modify, DSpace content, performing tasks is considered an administrative function to be available only to knowledgeable collection editors, repository administrators, sysadmins, etc. No tasks are exposed in the public interfaces.
...
For CS to run a task, the code for the task must of course be included with other deployed code (to {{\[dspace
\]/lib
}}, WAR, etc) but it must also be declared and given a name. This is done via a configuration property in {{\[dspace
\]/config/modules/curate.cfg
}} as follows:
Code Block |
---|
plugin.named.org.dspace.curate.CurationTask = \ org.dspace.ctask.general.NoOpCurationTask = noop, \ org.dspace.ctask.general.ProfileFormats = profileformats, \ org.dspace.ctask.general.RequiredMetadata = requiredmetadata, \ org.dspace.ctask.general.ClamScan = vscan, \ org.dspace.ctask.general.MicrosoftTranslator = translate, \ org.dspace.ctask.general.MetadataValueLinkChecker = checklinks |
For each activated task, a key-value pair is added. The key is the fully qualified class name and the value is the taskname used elsewhere to configure the use of the task, as will be seen below. Note that the curate.cfg configuration file, while in the config directory, is located under 'modules'. The intent is that tasks, as well as any configuration they require, will be optional 'add-ons' to the basic system configuration. Adding or removing tasks has no impact on dspace.cfg.
For many tasks, this activation configuration is all that will be required to use it. But for others, the task needs specific configuration itself. A concrete example is described below, but note that these task-specific configuration property files also reside in {{\ Wiki Markup [dspace
\]/config/modules
}}
A task is just a java class that can contain arbitrary code, but it must have 2 properties:
...
[your-handle-prefix
\]/0
}} Each of the above pages exposes a drop-down list of configured tasks, with a button to 'perform' the task, or queue it for later operation (see section below). Not all activated tasks need appear in the Curate tab - you filter them by means of a configuration property. This property also permits you to assign to the task a more user-friendly name than the PluginManager _taskname_. The property resides in {{\ Wiki Markup [dspace
\]/config/modules/curate.cfg
}}:
Code Block |
---|
ui.tasknames = \ profileformats = Profile Bitstream Formats, \ requiredmetadata = Check for Required Metadata |
...
The configuration of groups follows the same simple pattern as tasks, using properties in {{\ Wiki Markup [dspace
\]/config/modules/curate.cfg
}}. The group is assigned a simple logical name, but also a localizable name that appears in the UI. For example
Code Block |
---|
# ui.taskgroups contains the list of defined groups, together with a pretty name for UI display ui.taskgroups = \ replication = Backup and Restoration Tasks, \ integrity = Metadata Integrity Tasks, \ ..... # each group membership list is a separate property, whose value is comma-separated list of logical task names ui.taskgroup.integrity = profileformats, requiredmetadata .... |
CS provides the ability to attach any number of tasks to standard DSpace workflows. Using a configuration file {{\ Wiki Markup [dspace
\]/config/workflow-curation.xml
}}, you can declaratively (without coding) wire tasks to any step in a workflow. An example:
Code Block |
---|
<taskset-map> <mapping collection-handle="default" taskset="cautious" /> </taskset-map> <tasksets> <taskset name="cautious"> <flowstep name="step1"> <task name="vscan"> <workflow>reject</workflow> <notify on="fail">$flowgroup</notify> <notify on="fail">$colladmin</notify> <notify on="error">$siteadmin</notify> </task> </flowstep> </taskset> </tasksets> |
This markup would cause a virus scan to occur during step one of workflow for any collection, and automatically reject any submissions with infected files. It would further notify (via email) both the reviewers (step 1 group), and the collection administrators, if either of these are defined. If it could not perform the scan, the site administrator would be notified.
The notifications use the same procedures that other workflow notifications do - namely email. There is a new email template defined for curation task use: {{\ Wiki Markup [dspace
\]/config/emails/flowtask_notify
}}. This may be language-localized or otherwise modified like any other email template.
Like configurable submission, you can assign these task rules per collection, as well as having a default for any collection.
If these pre-defined ways are not sufficient, you can of course manage curation directly in your code. You would use the CS helper classes. For example:
Code Block |
---|
Collection coll = (Collection)HandleManager.resolveToObject(context, "123456789/4");
Curator curator = new Curator();
curator.addTask("vscan").curate(coll);
System.out.println("Result: " + curator.getResult("vscan"));
|
would do approximately what the command line invocation did. the method 'curate' just performs all the tasks configured
(you can add multiple tasks to a curator).
Because some tasks may consume a fair amount of time, it may not be desirable to run them in an interactive context. CS provides a simple API and means to defer task execution, by a queuing system. Thus, using the previous example:
Tasks wired in this way are normally performed as soon as the workflow step is entered, and the outcome action (defined by the 'workflow' element) immediately follows. It is also possible to delay the performance of the task - which will ensure a responsive system - by queuing the task instead of directly performing it:
Code Block |
---|
...
<taskset name="cautious">
<flowstep name="step1" queue="workflow">
...
|
This attribute (which must always follow the 'name' attribute in the flowstep element), will cause all tasks associated with the step to be placed on the queue named 'workflow' (or any queue you wish to use, of course), and further has the effect of suspending the workflow. When the queue is emptied (meaning all tasks in it performed), then the workflow is restarted. Each workflow step may be separately configured,
Like configurable submission, you can assign these task rules per collection, as well as having a default for any collection.
If these pre-defined ways are not sufficient, you can of course manage curation directly in your code. You would use the CS helper classes. For example:
Code Block |
---|
Collection coll = (Collection)HandleManager.resolveToObject(context, "123456789/4");
|
Code Block |
Curator curator = new Curator(); curator.addTask("vscan").queue(context, "monthly", "123456789/4"); |
would place a request on a named queue "monthly" to virus scan the collection. To read (and process) the queue, we could for example:
Code Block |
---|
[dspace]/bin/dspace curate -q monthly |
use the command-line tool, but we could also read the queue programmatically. Any number of queues can be defined and used as needed.
In the administrative UI curation 'widget', there is the ability to both perform a task, but also place it on a queue for later processing.
Few assumptions are made by CS about what the 'outcome' of a task may be (if any) - it. could e.g. produce a report to a temporary file, it could modify DSpace content silently, etc But the CS runtime does provide a few pieces of information whenever a task is performed:
This was mentioned above. This is returned to CS whenever a task is called. The complete list of values:
Code Block |
---|
-3 NOTASK - CS could not find the requested task
-2 UNSET - task did not return a status code because it has not yet run
-1 ERROR - task could not be performed
0 SUCCESS - task performed successfully
1 FAIL - task performed, but failed
2 SKIP - task not performed due to object not being eligible
|
In the administrative UI, this code is translated into the word or phrase configured by the ui.statusmessages property (discussed above) for display.
The task may define a string indicating details of the outcome. This result is displayed, in the 'curation widget' described above:
Code Block |
---|
"Virus 12312 detected on Bitstream 4 of 1234567789/3"
|
CS does not interpret or assign result strings, the task does it. A task may not assign a result, but the 'best practice' for tasks is to assign one whenever possible.
For very fine-grained information, a task may write to a reporting stream. This stream is sent to standard out, so is only available when running a task from the command line. Unlike the result string, there is no limit to the amount of data that may be pushed to this stream.
The status code, and the result string are accessed (or set) by methods on the Curation object:
Code Block |
---|
Curator curator = new Curator();
curator.addTask("vscan").curate(coll);
int status = curator.getStatus("vscan");
String result - curator.getResult("vscan");
|
curate(coll);
System.out.println("Result: " + curator.getResult("vscan"));
|
would do approximately what the command line invocation did. the method 'curate' just performs all the tasks configured
(you can add multiple tasks to a curator).
Because some tasks may consume a fair amount of time, it may not be desirable to run them in an interactive context. CS provides a simple API and means to defer task execution, by a queuing system. Thus, using the previous example:
Code Block |
---|
Curator curator = new Curator();
curator.addTask("vscan").queue(context, "monthly", "123456789/4");
|
would place a request on a named queue "monthly" to virus scan the collection. To read (and process) the queue, we could for example:
Code Block |
---|
[dspace]/bin/dspace curate -q monthly |
use the command-line tool, but we could also read the queue programmatically. Any number of queues can be defined and used as needed.
In the administrative UI curation 'widget', there is the ability to both perform a task, but also place it on a queue for later processing.
Few assumptions are made by CS about what the 'outcome' of a task may be (if any) - it. could e.g. produce a report to a temporary file, it could modify DSpace content silently, etc But the CS runtime does provide a few pieces of information whenever a task is performed:
This was mentioned above. This is returned to CS whenever a task is called. The complete list of values:
Code Block |
---|
-3 NOTASK - CS could not find the requested task
-2 UNSET - task did not return a status code because it has not yet run
-1 ERROR - task could not be performed
0 SUCCESS - task performed successfully
1 FAIL - task performed, but failed
2 SKIP - task not performed due to object not being eligible
|
In the administrative UI, this code is translated into the word or phrase configured by the ui.statusmessages property (discussed above) for display.
The task may define a string indicating details of the outcome. This result is displayed, in the 'curation widget' described above:
Code Block |
---|
"Virus 12312 detected on Bitstream 4 of 1234567789/3"
|
CS does not interpret or assign result strings, the task does it. A task may not assign a result, but the 'best practice' for tasks is to assign one whenever possible.
For very fine-grained information, a task may write to a reporting stream. This stream is sent to standard out, so is only available when running a task from the command line. Unlike the result string, there is no limit to the amount of data that may be pushed to this stream.
The status code, and the result string are accessed (or set) by methods on the Curation object:
Code Block |
---|
Curator curator = new Curator();
curator.addTask("vscan").curate(coll);
int status = curator.getStatus("vscan");
String result - curator.getResult("vscan");
|
DSpace 1.8 introduces a new 'idiom' for tasks that require configuration data. It is available to any task whose implementation extends AbstractCurationTask, but is completely optional. There are a number of problems that task properties are designed to solve, but to make the discussion concrete we will start with a particular one: the problem of hard-coded configuration file names. A task that relies on configuration data will typically encode a fixed reference to a configuration file name. For example, the virus scan task reads a file called 'clamav.cfg', which lives in [dspace]/config/modules
. And thus in the implementation one would DSpace 1.8 introduces a new 'idiom' for tasks that require configuration data. It is available to any task whose implementation extends _AbstractCurationTask_, but is completely optional. There are a number of problems that task properties are designed to solve, but to make the discussion concrete we will start with a particular one: the problem of hard-coded configuration file names. A task that relies on configuration data will typically encode a fixed reference to a configuration file name. For example, the virus scan task reads a file called 'clamav.cfg', which lives in {{\[dspace\]/config/modules}}. And thus in the implementation one would find: Wiki Markup
Code Block |
---|
host = ConfigurationManager.getProperty("clamav", "service.host"); |
...
Code Block |
---|
org.dspace.ctask.general.ClamAv = vscan, org.community.ctask.ConflictTask = virusscan, .... |
...
then 'taskProperty()' will resolve to {{\[dspace
\]/config/modules/vscan.cfg
}} when called from ClamAv task, but {{\[dspace
\]/config/modules/virusscan.cfg
}} when called from ConflictTask's code. Note that the 'vscan' etc are locally assigned names, so we can always prevent the 'collisions'mentioned, and we make the tasks much more portable, since we remove the 'hard-coding' of config names.
The entire 'API' for task properties is:
Code Block |
---|
public String taskProperty(String name); public int taskIntProperty(String name, int defaultValue); public long taskLongProperty(String name, long defaultValue); public boolean taskBooleanProperty(String name, boolean default); |
...
Another use of task properties is to support multiple task profiles. Suppose we have a task that we want to operate in one of two modes. A good example would be a mediafilter task that produces a thumbnail. We can either create one if it doesn't exist, or run with '-force' which will create one regardless. Suppose this behavior was controlled by a property in a config file. If we configured the task as 'thumbnail', then we would have in {{\[[dspace
\]/config/modules/thumbnail.cfg
}}:
Code Block |
---|
...other properties... thumbnail.maxheight = 80 thumbnail.maxwidth = 80 forceupdate=false |
...
DSpace 1.8 includes limited (and somewhat experimental) support for deploying and running tasks written in languages other than Java. Since version 6, Java has provided a standard way (API) to invoke so-called scripting or dynamic language code that runs on the java virtual machine (JVM). Scripted tasks are those written in a language accessible from this API. The exact number of supported languages will vary over time, and the degree of maturity of each language, or suitability of the language for curation tasks will also vary significantly. However, preliminary work indicates that Ruby (using the JRuby runtime) and Groovy may prove viable task languages.
Support for scripted tasks does *not* include any DSpace pre-installation of the scripting language itself - this must be done according to the instructions provided by the language maintainers, and typically only requires a few additional jars on the DSpace classpath. Once one or more languages have been installed into the DSpace deployment, task support is fairly straightforward. One new property must be defined in {{\ Wiki Markup [dspace
\]/config/modules/curate.cfg
}}:
Code Block |
---|
script.dir = ${dspace.dir}/scripts |
...
An example property for a link checking task written in Ruby might be:
Code Block |
---|
linkchecker = ruby|rubytask.rb|LinkChecker.new
|
...
DSpace 1.7 bundles a few tasks and activates two (2) by default to demonstrate the use of the curation system. These may be removed (deactivated by means of configuration) if desired without affecting system integrity. Each task is briefly described here.
This task does absolutely nothing. It is intended as a starting point for developers and administrators wishing to learn more about the curation system.
The task with the taskname 'formatprofiler' (in the admin UI it is labeled "Profile Bitstream Formats") examines all the bitstreams in an item and produces a table ("profile") which is assigned to the result string. It is activated by default, and is configured to display in the administrative UI. The result string has the layout:
...
This plugin requires a ClamAV daemon installed and configured for TCP sockets. Instructions for installing ClamAV (http://www.clamav.net/doc/latest/clamdoc.pdf )
NOTICE: The following directions assume there is a properly installed and configured clamav daemon. Refer to links above for more information about ClamAV.
The Clam anti-virus database must be updated regularly to maintain the most current level of anti-virus protection. Please refer to the ClamAV documentation for instructions about maintaining the anti-virus database.
In {{\ Wiki Markup [dspace
\]/config/modules/curate.cfg
}}, activate the task:
...
Code Block |
---|
ui.tasknames = \ profileformats = Profile Bitstream Formats, \ requiredmetadata = Check for Required Metadata, \ vscan = Scan for Viruses |
...
...
[dspace
...
]/config/modules
...
...
...
...
...
Code Block |
---|
service.host = 127.0.0.1 Change if not running on the same host as your DSpace installation. service.port = 3310 Change if not using standard ClamAV port socket.timeout = 120 Change if longer timeout needed scan.failfast = false Change only if items have large numbers of bitstreams |
[dspace]/config/modules
, edit configuration file submission-curation.cfg
:Code Block |
---|
virus-scan = true
|
Curation tasks can be run against container and item dspace Curation tasks can be run against container and item dspace objects by e-persons with administrative privileges. A curation tab will appear in the administrative ui after logging into DSpace:
...
To output the results to the console:
...
If desired virus scanning can be enabled as part of the submission process upload file step. In [dspace]/config/modules
, edit configuration file submission-curation.cfg
:
Code Block |
---|
virus-scan = true
|
To output the results to the console:
Code Block |
---|
[dspace]/bin/dspacebin/dspace curate -t vscan -i <handle of container or item dso> -r - |
...
GUI (Interactive Mode) | FailFast | Expectation |
Container | T | Stop on 1st Infected Bitstream |
Container | F | Stop on 1st Infected Item |
Item | T | Stop on 1st Infected Bitstream |
Item | F | Scan all bitstreams |
|
|
|
Command Line |
|
|
Container | T | Report on 1st infected bitstream within an item/Scan all contained Items |
Container | F | Report on all infected bitstreams/Scan all contained Items |
Item |
| Report on 1st infected bitstream |
Item |
| Report on all infected bitstreams |
Two link checker tasks, BasicLinkChecker and MetadataValueLinkChecker can be used to check for broken or unresolvable links appearing in item metadata.
This task is intended as a prototype / example for developers and administrators who are new to the curation system.
These tasks are not configurable.
BasicLinkChecker iterates over all metadata fields ending in "uri" (eg. dc.relation.uri, dc.identifier.uri, dc.source.uri ...), attempts a GET to the value of the field, and checks for a 200 OK response.
Results are reported in a simple "one row per link" format.
MetadataValueLinkChecker parses all metadata fields for valid HTTP URLs, attempts a GET to those URLs, and checks for a 200 OK response.
Results are reported in a simple "one row per link" format.
Microsoft Translator uses the Microsoft Translate API to translate metadata values from one source language into one or more target languages.
This task cab be configured to process particular fields, and use a default language if no authoritative language for an item can be found. Bing API v2 key is needed.
MicrosoftTranslator extends the more generic AbstractTranslator. This now seems wasteful, but a GoogleTranslator had also been written to extend AbstractTranslator. Unfortunately, Google has announced they are decommissioning free Translate API service, so this task hasn't been included in DSpace's general set of curation tasks.
Translated fields are added in addition to any existing fields, with the target language code in the 'language' column. This means that running a task multiple times over one item with the same configuration could result in duplicate metadata.
This task is intended as a prototype / example for developers and administrators who are new to the curation system.
An example configuration file can be found in [dspace]/config/modules/translator.cfg.
Code Block |
---|
#---------------------------------------------------------------#
#----------TRANSLATOR CURATION TASK CONFIGURATIONS--------------#
#---------------------------------------------------------------#
# Configuration properties used solely by MicrosoftTranslator #
# Curation Task (uses Microsoft Translation API v2) #
#---------------------------------------------------------------#
## Translation field settings
##
## Authoritative language field
## This will be read to determine the original language an item was submitted in
## Default: dc.language
translate.field.language = dc.language
## Metadata fields you wish to have translated
#
translate.field.targets = dc.description.abstract, dc.title, dc.type
## Translation language settings
##
## If the language field configured in translate.field.language is not present
## in the record, set translate.language.default to a default source language
## or leave blank to use autodetection
#
translate.language.default = en
## Target languages for translation
#
translate.language.targets = de, fr
## Translation API settings
##
## Your Bing API v2 key and/or Google "Simple API Access" Key
## (note to Google users: your v1 API key will not work with Translate v2,
## you will need to visit https://code.google.com/apis/console and activate
## a Simple API Access key)
##
## You do not need to enter a key for both services.
#
translate.api.key.microsoft = YOUR_MICROSOFT_API_KEY_GOES_HERE
translate.api.key.google = YOUR_GOOGLE_API_KEY_GOES_HERE
|