Date: Fri, 29 Mar 2024 05:57:36 -0400 (EDT) Message-ID: <181535821.30204.1711706256093@lyrasis1-roc-mp1> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_30203_650258179.1711706256093" ------=_Part_30203_650258179.1711706256093 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
As of release 1.7, DSpace supports running curation tasks, which are des= cribed in this section. DSpace 1.7 and subsequent distributions will bundle= (include) several useful tasks, but the system also is designed to allow n= ew tasks to be added between releases, both general purpose tasks that come= from the community, and locally written and deployed tasks.
The goal of the curation system ('CS') is to provide a simple, extensibl= e way to manage routine content operations on a repository. These operation= s are known to CS as 'tasks', and they can operate on any DSpaceObject (i.e= . subclasses of DSpaceObject) - which means Communities, Collections, and I= tems - viz. core data model objects. Tasks may elect to work on only one ty= pe of DSpace object - typically an Item - and in this case they may simply = ignore other data types (tasks have the ability to 'skip' objects for any r= eason). The DSpace core distribution will provide a number of useful tasks,= but the system is designed to encourage local extension - tasks can be wri= tten for any purpose, and placed in any java package. This gives DSpace sit= es the ability to customize the behavior of their repository without having= to alter - and therefore manage synchronization with - the DSpace source c= ode. What sorts of activities are appropriate for tasks?
Some examples:
Since tasks have access to, and can modify, DSpace content, performing t= asks is considered an administrative function to be available only to knowl= edgeable collection editors, repository administrators, sysadmins, etc. No = tasks are exposed in the public interfaces.
For CS to run a task, the code for the task must of course be included w=
ith other deployed code (to [dspace]/lib
, WAR, etc) but it mus=
t also be declared and given a name. This is done via a configuration prope=
rty in [dspace]/config/modules/curate.cfg
as follows:
plugin.= named.org.dspace.curate.CurationTask =3D \ org.dspace.curate.ProfileFormats =3D profileformats, \ org.dspace.curate.RequiredMetadata =3D requiredmetadata, \ org.dspace.curate.ClamScan =3D vscan
For each activated task, a key-value pair is added. The key is the fully= qualified class name and the value is the taskname used elsewhere= to configure the use of the task, as will be seen below. Note that the cur= ate.cfg configuration file, while in the config directory, is located under= 'modules'. The intent is that tasks, as well as any configuration they req= uire, will be optional 'add-ons' to the basic system configuration. Adding = or removing tasks has no impact on dspace.cfg.
For many tasks, this activation configuration is all that will be requir=
ed to use it. But for others, the task needs specific configuration itself.=
A concrete example is described below, but note that these task-specific c=
onfiguration property files also reside in [dspace]/config/modules
A task is just a java class that can contain arbitrary code, but it must= have 2 properties:
First, it must provide a no argument constructor, so it can be loaded by= the PluginManager. Thus, all tasks are 'named' plugins, with the taskname = being the plugin name.
Second, it must implement the interface 'org.dspace.curate.CurationTask'=
The CurationTask interface is almost a 'tagging' interface, and only req= uires a few very high-level methods be implemented. The most significant is= :
int pe= rform(DSpaceObject dso);
The return value should be a code describing one of 4 conditions:
If a task extends the AbstractCurationTask class, that is the only metho= d it needs to define.
Tasks are invoked using CS framework classes that manage a few details (= to be described below), and this invocation can occur wherever needed, but = CS offers great versatility 'out of the box':
A simple tool 'CurationCli' provides access to CS via the command line. = This tool bears the name 'curate' in the DSpace launcher. For example, to p= erform a virus check on collection '4':
[dspac= e]/bin/dspace curate -t vscan -i 123456789/4
The complete list of arguments:
-t task= name: name of task to perform -T filename: name of file containing list of tasknames -e epersonID: (email address) will be superuser if unspecified -i identifier: Id of object to curate. May be (1) a handle (2) a workflow I= d or (3) 'all' to operate on the whole repository -q queue: name of queue to process - -i and -q are mutually exclusive -v emit verbose output -r - emit reporting to standard out
As with other command-line tools, these invocations could be placed in a= cron table and run on a fixed schedule, or run on demand by an administrat= or.
In the XMLUI, there is a 'Curate' tab (appearing within the 'Edit Commun=
ity/Collection/Item') that exposes a drop-down list of configured tasks, wi=
th a button to 'perform' the task, or queue it for later operation (see sec=
tion below). Not all activated tasks need appear in the Curate tab - you fi=
lter them by means of a configuration property. This property also permits =
you to assign to the task a more user-friendly name than the PluginManager =
taskname. The property resides in [dspace]/config/modules/cu=
rate.cfg
:
ui.task= names =3D \ profileformats =3D Profile Bitstream Formats, \ requiredmetadata =3D Check for Required Metadata
When a task is selected from the drop-down list and performed, the tab d= isplays both a phrase interpreting the 'status code' of the task execution,= and the 'result' message if any has been defined. When the task has been q= ueued, an acknowledgement appears instead. You may configure the words used= for status codes in curate.cfg (for clarity, language localization, etc):<= /p>
ui.stat= usmessages =3D \ -3 =3D Unknown Task, \ -2 =3D No Status Set, \ -1 =3D Error, \ 0 =3D Success, \ 1 =3D Fail, \ 2 =3D Skip, \ other =3D Invalid Status
CS provides the ability to attach any number of tasks to standard DSpace=
workflows. Using a configuration file [dspace]/config/workflow-curat=
ion.xml
, you can declaratively (without coding) wire tasks to any st=
ep in a workflow. An example:
<tas= kset-map> <mapping collection-handle=3D"default" taskset=3D"cautious" /> </taskset-map> <tasksets> <taskset name=3D"cautious"> <flowstep name=3D"step1"> <task name=3D"vscan"> <workflow>reject</workflow> <notify on=3D"fail">$flowgroup</notify> <notify on=3D"fail">$colladmin</notify> <notify on=3D"error">$siteadmin</notify> </task> </flowstep> </taskset> </tasksets>
This markup would cause a virus scan to occur during step one of workflo= w for any collection, and automatically reject any submissions with infecte= d files. It would further notify (via email) both the reviewers (step 1 gro= up), and the collection administrators, if either of these are defined. If = it could not perform the scan, the site administrator would be notified.
The notifications use the same procedures that other workflow notificati=
ons do - namely email. There is a new email template defined for curation t=
ask use: [dspace]/config/emails/flowtask_notify
. This may be l=
anguage-localized or otherwise modified like any other email template.
Like configurable submission, you can assign these task rules per collec= tion, as well as having a default for any collection.
If these pre-defined ways are not sufficient, you can of course manage c= uration directly in your code. You would use the CS helper classes. For exa= mple:
Collect= ion coll =3D (Collection)HandleManager.resolveToObject(context, "123456789/= 4"); Curator curator =3D new Curator(); curator.addTask("vscan").curate(coll); System.out.println("Result: " + curator.getResult("vscan"));
would do approximately what the command line invocation did. the method =
'curate' just performs all the tasks configured
(you can add multiple tasks to a curator).
Because some tasks may consume a fair amount of time, it may not be desi= rable to run them in an interactive context. CS provides a simple API and m= eans to defer task execution, by a queuing system. Thus, using the previous= example:
Cu= rator curator =3D new Curator(); curator.addTask("vscan").queue(context, "monthly", "123456789/4");
would place a request on a named queue "monthly" to virus scan the colle= ction. To read (and process) the queue, we could for example:
[dspac= e]/bin/dspace curate -q monthly
use the command-line tool, but we could also read the queue programmatic=
ally. Any number of queues can be defined and used as needed.
In the administrative UI curation 'widget', there is the ability to both p=
erform a task, but also place it on a queue for later processing.
Few assumptions are made by CS about what the 'outcome' of a task may be= (if any) - it. could e.g. produce a report to a temporary file, it could m= odify DSpace content silently, etc But the CS runtime does provide a few pi= eces of information whenever a task is performed:
This was mentioned above. This is returned to CS whenever a task is call= ed. The complete list of values:
-= 3 NOTASK - CS could not find the requested task -2 UNSET - task did not return a status code because it has not yet = run -1 ERROR - task could not be performed 0 SUCCESS - task performed successfully 1 FAIL - task performed, but failed 2 SKIP - task not performed due to object not being eligible
In the administrative UI, this code is translated into the word or phras= e configured by the ui.statusmessages property (discussed above) f= or display.
The task may define a string indicating details of the outcome. This res= ult is displayed, in the 'curation widget' described above:
= "Virus 12312 detected on Bitstream 4 of 1234567789/3"
CS does not interpret or assign result strings, the task does it. A task= may not assign a result, but the 'best practice' for tasks is to assign on= e whenever possible.
For very fine-grained information, a task may write to a reporting= em> stream. This stream is sent to standard out, so is only available when = running a task from the command line. Unlike the result string, there is no= limit to the amount of data that may be pushed to this stream.
The status code, and the result string are accessed (or set) by methods = on the Curation object:
Cu= rator curator =3D new Curator(); curator.addTask("vscan").curate(coll); int status =3D curator.getStatus("vscan"); String result - curator.getResult("vscan");
CS looks for, and will use, certain java annotations in the task Class d= efinition that can help it invoke tasks more intelligently. An example may = explain best. Since tasks operate on DSOs that can either be simple (Items)= or containers (Collections, and Communities), there is a fundamental probl= em or ambiguity in how a task is invoked: if the DSO is a collection, shoul= d the CS invoke the task on each member of the collection, or does the task= 'know' how to do that itself? The decision is made by looking for the @Dis= tributive annotation: if present, CS assumes that the task will manage the = details, otherwise CS will walk the collection, and invoke the task on each= member. The java class would be defined:
@Distri= butive public class MyTask implements CurationTask
A related issue concerns how non-distributive tasks report their status = and results: the status will normally reflect only the last invocation of t= he task in the container, so important outcomes could be lost. If a task de= clares itself @Suspendable, however, the CS will cease processing when it e= ncounters a FAIL status. When used in the UI, for example, this would mean = that if our virus scan is running over a collection, it would stop and retu= rn status (and result) to the scene on the first infected item it encounter= s. You can even tune @Supendable tasks more precisely by annotating what in= vocations you want to suspend on. For example:
@Suspen= dable(invoked=3DCurator.Invoked.INTERACTIVE) public class MyTask implements CurationTask
would mean that the task would suspend if invoked in the UI, but would r= un to completion if run on the command-line.
Only a few annotation types have been defined so far, but as the number = of tasks grow, we can look for common behavior that can be signaled by anno= tation. For example, there is a @Mutative type: that tells CS that the task= may alter (mutate) the object it is working on.
DSpace 1.7 bundles a few tasks and activates two (2) by default to demon= strate the use of the curation system. These may be removed (deactivated by= means of configuration) if desired without affecting system integrity. Eac= h task is briefly described here.
The task with the taskname 'formatprofiler' (in the admin UI it is label= ed "Profile Bitstream Formats") examines all the bitstreams in an item and = produces a table ("profile") which is assigned to the result string. It is = activated by default, and is configured to display in the administrative UI= . The result string has the layout:
10= (K) Portable Network Graphics 5 (S) Plain Text
where the left column is the count of bitstreams of the named format and= the letter in parentheses is an abbreviation of the repository-assigned su= pport level for that format:
U = Unsupported K Known S Supported
The profiler will operate on any DSpace object. If the object is an item= , then only that item's bitstreams are profiled; if a collection, all the b= itstreams of all the items; if a community, all the items of all the collec= tions of the community.
The 'requiredmetadata' task examines item metadata and determines whethe= r fields that the web submission (input-forms.xml) marks as required are pr= esent. It sets the result string to indicate either that all required field= s are present, or constructs a list of metadata elements that are required = but missing. When the task is performed on an item, it will display the res= ult for that item. When performed on a collection or community, the task be= performed on each item, and will display the last item result. If= all items in the community or collection have all required fields, that wi= ll be the last in the collection. If the task fails for any item (i.e. the = item lacks all required fields), the process is halted. This way the result= s for the 'failed' items are not lost.
The 'vscan' task performs a virus scan on the bitstreams of items using =
the ClamAV software product.
Clam AntiVirus is an open source (GPL) anti-virus toolkit for UNIX. A port=
for Windows is also available. The virus scanning curation task interacts =
with the ClamAV virus scanning service to scan the bitstreams contained in =
items, reporting on infection(s). Like other curation tasks, it can be run =
against a container or item, in the GUI or from the command line. It should=
be installed according to the documentation at http://www.clamav.net.=
It should not be installed in the dspace installation directory. You may i=
nstall it on the same machine as your dspace installation, or on another ma=
chine which has been configured properly.
This plugin requires a ClamAV daemon installed and configured for TCP so= ckets. Instructions for installing ClamAV (http://www.clamav.net/doc/latest/clamdoc.= pdf )
NOTICE: The following directions assume there is a properly installed an=
d configured clamav daemon. Refer to links above for more information about=
ClamAV.
The Clam anti-virus database must be updated regularly to maintain the mos=
t current level of anti-virus protection. Please refer to the ClamAV docume=
ntation for instructions about maintaining the anti-virus database.
In [dspace]/config/modules/curate.cfg
, activate the task:=
p>
### Tas= k Class implementations plugin.named.org.dspace.curate.CurationTask =3D \ org.dspace.curate.ProfileFormats =3D profileformats, \ org.dspace.curate.RequiredMetadata =3D requiredmetadata, \ org.dspace.curate.ClamScan =3D vscan
ui.task= names =3D \ profileformats =3D Profile Bitstream Formats, \ requiredmetadata =3D Check for Required Metadata, \ vscan =3D Scan for Viruses
In [dspace]/config/modules
, edit configuration file clamav.=
cfg:
service= .host =3D 127.0.0.1 Change if not running on the same host as your DSpace installation. service.port =3D 3310 Change if not using standard ClamAV port socket.timeout =3D 120 Change if longer timeout needed scan.failfast =3D false Change only if items have large numbers of bitstreams
Curation tasks can be run against container and item dspace objects by e= -persons with administrative privileges. A curation tab will appear in the = administrative ui after logging into DSpace:
To output the results to the console:
[dspace= ]/bin/dspace curate -t vscan -i <handle of container or item dso> -r = -
Or capture the results in a file:
[dspace= ]/bin/dspace curate -t vscan -i <handle of container or item dso> -r = - > /<path...>/<name>
GUI (Interactive Mode) = td> | FailFast |
Expectation |
Container |
T |
Stop on 1st Infected Bitstream = |
Container |
F |
Stop on 1st Infected Item |
Item |
T |
Stop on 1st Infected Bitstream = |
Item |
F |
Scan all bitstreams |
|
|
|
Command Line |
|
|
Container |
T |
Report on 1st infected bitstream w= ithin an item/Scan all contained Items |
Container |
F |
Report on all infected bitstreams/Scan all co= ntained Items |
Item |
|
Report on 1st infected bitstream= p> |
Item |
|
Report on all infected bitstreams |