Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Removed duplicated text.

...

For CS to run a task, the code for the task must of course be included with other deployed code (to [dspace]/lib, WAR, etc) but it must also be declared and given a name. This is done via a configuration property in [dspace]/config/modules/curate.cfg as follows:

Code Block
### Task Class implementations
plugin.named.org.dspace.curate.CurationTask = \
org.dspace.ctask.general.NoOpCurationTask = noop, \

plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.ProfileFormats = profileformats, \

plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.RequiredMetadata = requiredmetadata, \

plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.ClamScan = vscan, \

plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.MicrosoftTranslator = translate, \

plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.MetadataValueLinkChecker = checklinks

...

Each of the above pages exposes a drop-down list of configured tasks, with a button to 'perform' the task, or queue it for later operation (see section below). Not all activated tasks need appear in the Curate tab - you filter them by means of a configuration property. This property also permits you to assign to the task a more user-friendly name than the PluginManager taskname. The property resides in [dspace]/config/modules/curate.cfg:

Code Block
curate.ui.tasknames = \
     profileformats = Profile Bitstream Formats, \
     
curate.ui.tasknames = requiredmetadata = Check for Required Metadata

When a task is selected from the drop-down list and performed, the tab displays both a phrase interpreting the "status code" of the task execution, and the "result" message if any has been defined. When the task has been queued, an acknowledgement appears instead. You may configure the words used for status codes in curate.cfg (for clarity, language localization, etc):

Code Block
curate.ui.statusmessages = \
     -3 = Unknown Task, \
    
curate.ui.statusmessages = -2 = No Status Set, \
    
curate.ui.statusmessages = -1 = Error, \
    
curate.ui.statusmessages = 0 = Success, \
     
curate.ui.statusmessages = 1 = Fail, \
     
curate.ui.statusmessages = 2 = Skip, \
     
curate.ui.statusmessages = other = Invalid Status

As the number of tasks configured for a system grows, a simple drop-down list of all tasks may become too cluttered or large. DSpace 1.8+ provides a way to address this issue, known as task groups. A task group is a simple collection of tasks that the Admin UI will display in a separate drop-down list. You may define as many or as few groups as you please. If no groups are defined, then all tasks that are listed in the ui.tasknames property will appear in a single drop-down list. If at least one group is defined, then the admin UI will display two drop-down lists. The first is the list of task groups, and the second is the list of task names associated with the selected group. A few key points to keep in mind when setting up task groups:

...

Code Block
# ui.taskgroups contains the list of defined groups, together with a pretty name for UI display
curate.ui.taskgroups = \
  replication = Backup and Restoration Tasks, \
 
curate.ui.taskgroups = integrity = Metadata Integrity Tasks, \
  .....
# each group membership list is a separate property, whose value is comma-separated list of logical task names
curate.ui.taskgroup.integrity = profileformats, requiredmetadata
....

...

Code Block
languagejava
Curator curator = new Curator();
     curator.addTask("vscan").queue(context, "monthly", "123456789/4");

...

Code Block
languagejava
host = ConfigurationManagerconfigurationService.getProperty("clamav", ".service.host");

and similar. But tasks are supposed to be written by anyone in the community and shared around (without prior coordination), so if another task uses the same configuration file name, there is a name collision here that can't be easily fixed, since the reference is hard-coded in each task. In this case, if we wanted to use both at a given site, we would have to alter the source of one of them - which introduces needless code localization and maintenance.

...

Code Block
languagejava
host = taskProperty("clamav.service.host");

Note that there is no name of the configuration file even mentioned, just the property name whose value we want. At runtime, the curation system resolves this call to a set of configuration fileproperties, and it uses the name the task has been configured as as the name prefix of the config fileproperties. So, for example, if both were installed (in, say, curate.cfg) as:

Code Block
org.dspace.ctask.general.ClamAv = vscan,
org.community.ctask.ConflictTask = virusscan,
....

then "taskProperty("foo")" will resolve to [dspace]/config/modules/the property named vscan.cfgfoo when called from ClamAv task, but [dspace]/config/modules/virusscan.cfgfoo when called from ConflictTask's code. Note that the "vscan" etc are locally assigned names, so we can always prevent the "collisions" mentioned, and we make the tasks much more portable, since we remove the "hard-coding" of config names.

...

Another use of task properties is to support multiple task profiles. Suppose we have a task that we want to operate in one of two modes. A good example would be a mediafilter task that produces a thumbnail. We can either create one if it doesn't exist, or run with "-force" which will create one regardless. Suppose this behavior was controlled by a property in a config file. If we configured the task as "thumbnail", then we would have in (perhaps) [dspace]/config/modules/thumbnail.cfg:

Code Block
...other properties...
thumbnail.thumbnail.maxheight = 80
thumbnail.thumbnail.maxwidth = 80
thumbnail.forceupdate=false

Then, following the pattern above, the thumbnail generating task code would look like:

...

But an obvious use-case would be to want to run force mode and non-force mode from the admin UI on different occasions. To do this, one would have to stop Tomcat, change the property value in the config file, and restart, etc However, we can use task properties to elegantly rescue us here. All we need to do is go into the config/modules directory, and create a new file perhaps called: thumbnail.force.cfg. In this file, we put only one propertythe properties:

Code Block
thumbnail.force.thumbnail.maxheight = 80
thumbnail.force.thumbnail.maxwidth = 80
thumbnail.force.forceupdate=true

Then we add a new task (really just a new name, no new code) in curate.cfg:

Code Block
org.dspace.ctask.general.ThumbnailTask = thumbnail,
org.dspace.ctask.general.ThumbnailTask = thumbnail.force

Consider what happens: when we perform the task "thumbnail" (using taskProperties), it reads uses the config file thumbnail.cfg * properties and operates in "non-force" profile (since the value is false), but when we run the task "thumbnail.force" the curation system first reads uses the thumbnail.cfg, then reads thumbnail.force.cfg which overrides the value of the "forceupdate" propertyforce.* properties. Notice that we did all this via local configuration - we have not had to touch the source code at all to obtain as many "profiles" as we would like.

...

Support for scripted tasks does not include any DSpace pre-installation of the scripting language itself - this must be done according to the instructions provided by the language maintainers, and typically only requires a few additional jars on the DSpace classpath. Once one or more languages have been installed into the DSpace deployment, task support is fairly straightforward. One new property must be defined in [dspace]/config/modules/curate.cfg:

Code Block
curate.script.dir = ${dspace.dir}/scripts

...

DSpace item metadata can contain any number of identifiers or other field values that participate in networked information systems. For example, an item may include a DOI which is a controlled identifier in the DOI registry. Many web services exist to leverage these values, by using them as 'keys' to retrieve other useful data. In the DOI case for example, CrossRef provides many services that given a DOI will return author lists, citations, etc. The MetadataWebService task enables the use of such services, and allows you to obtain and (optionally) add to DSpace metadata the results of any web service call to any service provider. You simply need to describe what service you want to call, and what to do with the results. Using the task code ([taskcode]), you can create as many distinct tasks as you have services you want to call.

Each task description lives in a configuration file in 'config/modules' (or in your local.cfg), and is a simple properties file, like all other DSpace configuration files . The name of the configuration file is the task name you assign to it (see Configuration Reference). All of the settings associated with a given task should be prepended with the task name (as assigned in config/modules/curate.cfg). For example, if the task name is issn2pubname in curate.cfg, then all settings should start with "issn2pubname."  Your settings can either be set in your local.cfg , or in a new configuration file which is included (include = path/to/new/file.cfg) into either your local.cfg or the dspace.cfg. See the Configuration Reference for examples of including configuration files, or modifying your local.cfg

There are a few required properties you must configure for any service, and for certain services, a few . There are a few required properties you must configure for any service, and for certain services, a few additional ones. An example will illustrate best.

...

Suppose items (holding journal articles) include 'dc.identifier.issn' when available. We might also want to catalog the publisher name (in 'dc.publisher'). The cataloger could look up the name given the ISSN in various sources, but this 'research' is tedious, costly and error-prone. There are many good quality, free web services that can furnish this information. So we will configure a MetadataWebService task to call a service, and then automatically assign the publisher name to the item metadata. As noted above, all that is needed is a description of the service, and what to do with the results. Create a new file in 'config/modules' called 'issn2pubname.cfg' (or whatever is mnemonically useful to you). The first property in this file describes the service in a 'template'. The template is just the URL to call the web service, with parameters to substitute values in. Here we will use the 'Sherpa/Romeo' service:

Code Block
[taskcode].template=http://www.sherpa.ac.uk/romeo/api29.php?issn={dc.identifier.issn}

When the task runs, it will replace '{dc.identifier.issn}' with the value of that field in the item, If the field has multiple values, the first one will be used. As a web service, the call to the above URL will return an XML document containing information (including the publisher name) about that ISSN. We need to describe what to do with this response document, i.e. what elements we want to extract, and what to do with the extracted content. This description is encoded in a property called the 'datamap'. Using the example service above we might have:

Code Block
[taskcode].datamap=//publisher/name=>dc.publisher,//romeocolor

...

The third part (here 'dc.publisher') is simply the name of the metadata field to be updated. These two mandatory properties (template and datamap) are sufficient to describe a large number of web services. All that is required to enable this task is to edit 'config/modules/curate.cfg' , add (or your local.cfg), and add 'issn2pubname' to the list of tasks:

Code Block
plugin.named.org.dspace.curate.CurationTask = \
... other defined tasks
org.dspace.ctask.general.MetadataWebService = issn2pubname, \
plugin.named.org.dspace.curate.CurationTask other metadatata web service tasks
= org.dspace.ctask.general.MetadataWebService = doi2crossref, \

If you wish the task to be available in the Admin UI, see the Invocation from the Admin UI documentation (above) about how to configure it. The remaining sections describe some more specialized needs using the MetadataWebService task.

...

For some web services, protocol and other information is expressed not in the service URL, but in HTTP headers. Examples might be HTTP basic auth tokens, or requests for a particular media type response. In these cases, simply add a property to the configuration file (our example was 'issn2pubname.cfg') containing all headers you wish to transmit to the service:

Code Block
[taskcode].headers=Accept: application/xml||Cache-Control: no-cache

You can specify any number of headers, just separate them with a 'double-pipe' ('||').  Ensure that any commas in values are escaped (with backslash comma, i.e. '\,').

Transformations

One potential problem with the simple parameter substitutions performed by the task is that the service might expect a different format or expression of a value than the way it is stored in the item metadata. For example, a DOI service might expect a bare prefix/suffix notation ('10.000/12345'), whereas the DSpace metadata field might have a URI representation ('http://dx.doi.org/10.000/12345'). In these cases one can declare a 'transformation' of a value in the template. For example:

Code Block
[taskcode].template=http://www.crossref.org/openurl/?id={doi:dc.relation.isversionof}&format=unixref

The 'doi:' prepended to the metadata field name declares that the value of the 'dc.relation.isversionof' field should be transformed before the substitution into the template using a transformation named 'doi'.  The transformation is itself defined in the same configuration file as follows:

Code Block
[taskcode].transform.doi=match 10. trunc 60

...

When the task is run, if the transformation results in an invalid state (e.g. cutting more characters than there are in the value), the un-transformed value will be used and the condition will be logged.  Transformations may also be applied to values returned from the web service. That is, one can apply the transformation to a value before assigning it to a metadata field. In this case, the declaration occurs in the datamap property, not the template:

Code Block
[taskcode].datamap=//publisher/name=>shorten:dc.publisher,//romeocolor

...

Normally a task result string appears in a window in the admin UI after it has been invoked. The MedataWebService task will concatenate all the values declared in the 'datamap' property and place them in the result string using the format: 'name:value name:value' for as many values as declared. In the example above we would get a string like 'publisher: Nature romeocolor: green'. This format is fine for simple display purposes, but can be tricky if the values contain spaces. You can override the space separator using an optional property 'separator' (put in the config file, with all other properties). If you use:

Code Block
[taskcode].separator=||

for example, it becomes easy to parse the result string and preserve spaces in the values. This use of the result string can be very powerful, since you are essentially creating a map of returned values, which can then be used to populate a user interface, or any other way you wish to exploit the data (drive a workflow, etc).

...

A few limitations should be noted. FIrstFirst, since the response parsing utilizes XPath, the service can only operate on XML, (not JSON) response documents. Most web services can provide either, so this should not be a major obstacle. The MetadataWebService can be used in many ways: showing an admin a value in the result string in a UI, run in a batch to update a set of items, etc. One excellent configuration is to wire these tasks into submission workflow, so that 'automatic cataloging' of many fields can be performed on ingest.

...

In [dspace]/config/modules/curate.cfg, activate the task:

  • Add the plugin to the comma separated list of curation tasks.
Code Block
### Task Class implementations
plugin.named.org.dspace.curate.CurationTask = \
org.dspace.ctask.general.ProfileFormatsNoOpCurationTask = profileformats, \
noop
plugin.named.org.dspace.ctask.curate.CurationTask = org.dspace.ctask.general.ProfileFormats = profileformats
plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.RequiredMetadata = requiredmetadata, \

# This is the ClamAV scanner plugin
plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.ClamScan = vscan
plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.MicrosoftTranslator = translate
plugin.named.org.dspace.curate.CurationTask = org.dspace.ctask.general.MetadataValueLinkChecker = checklinks
  • Optionally, add the vscan friendly name to the configuration to enable it in the administrative it in the administrative user interface.user interface.
Code Block
curate.ui.tasknames = profileformats = Profile Bitstream Formats
curate.ui.tasknames = requiredmetadata = Check for Required Metadata
curate.
Code Block
ui.tasknames = \
profileformatschecklinks = ProfileCheck BitstreamLinks Formats,in \Metadata
requiredmetadata# =Enable CheckClamAV for Required Metadata, \
from UI
curate.ui.tasknames = vscan = Virus Scan for Viruses
  • In [dspace]/config/modules, edit configuration file clamav.cfg:
Code Block
clamav.service.host = 127.0.0.1
# Change if not running on the same host as your DSpace installation.
clamav.service.port = 3310
# Change if not using standard ClamAV port
clamav.socket.timeout = 120
# Change if longer timeout needed
clamav.scan.failfast = false
# Change only if items have large numbers of bitstreams
  • Finally, if desired virus scanning can be enabled as part of the submission process upload file step. In [dspace]/config/modules, edit configuration file submission-curation.cfg:
Code Block
submission-curation.virus-scan = true

Task Operation from the Administrative user interface

...

If desired virus scanning can be enabled as part of the submission process upload file step. In [dspace]/config/modules, edit configuration file submission-curation.cfg:

Code Block
submission-curation.virus-scan = true

Task Operation from the curation command line client

...

Table 1 – Virus Scan Results Table

GUI (Interactive Mode)

FailFast

Expectation

Container

T

Stop on 1st Infected Bitstream

Container

F

Stop on 1st Infected Item

Item

T

Stop on 1st Infected Bitstream

Item

F

Scan all bitstreams

 

 

 




Command Line

 

 



Container

T

Report on 1st infected bitstream within an item/Scan all contained Items

Container

F

Report on all infected bitstreams/Scan all contained Items

Item

T

Report on 1st infected bitstream

Item

F

Report on all infected bitstreams

Link Checkers

Two link checker tasks, BasicLinkChecker and MetadataValueLinkChecker can be used to check for broken or unresolvable links appearing in item metadata.

...

An example configuration file can be found in [dspace]/config/modules/translator.cfg.

Code Block
#---------------------------------------------------------------#
#----------TRANSLATOR CURATION TASK CONFIGURATIONS--------------#
#---------------------------------------------------------------#
# Configuration properties used solely by MicrosoftTranslator   #
# Curation Task (uses Microsoft Translation API v2)             #
#---------------------------------------------------------------#
## Translation field settings
##
## Authoritative language field
## This will be read to determine the original language an item was submitted in
## Default: dc.language

translatetranslator.field.language = dc.language

## Metadata fields you wish to have translated
#
translatetranslator.field.targets = dc.description.abstract, dc.title, dc.type

## Translation language settings
##
## If the language field configured in translate.field.language is not present
## in the record, set translate.language.default to a default source language
## or leave blank to use autodetection
#
translatetranslator.language.default = en

## Target languages for translation
#
translatetranslator.language.targets = de, fr

## Translation API settings
##
## Your Bing API v2 key and/or Google "Simple API Access" Key
## (note to Google users: your v1 API key will not work with Translate v2,
## you will need to visit https://code.google.com/apis/console and activate
## a Simple API Access key)
##
## You do not need to enter a key for both services.
#
translatetranslator.api.key.microsoft = YOUR_MICROSOFT_API_KEY_GOES_HERE
translatetranslator.api.key.google = YOUR_GOOGLE_API_KEY_GOES_HERE

...