Benefits of the Curation System

The DSpace curation system can be invoked for a

  1. Site
  2. Community
  3. Collection
  4. Item.

Curation System tasks can be run 

  1. Immediately
  2. Queued for background execution - uses separate memory

Limitations of the Curation System

  • Curation system does not persist output (only permits STDOUT redirection)
  • Curation system produces only text
    • HTML would be useful for an admin report
    • JSON would be useful REST
  • Curation system does not take parameters
    • Every variation of a task requires a new module

Proposal

  • Allow tasks to produce output as HTML and/or JSON
    • XML if needed
    • It is important to allow links to be included in the output
  • Allow persistence of output
    • Database?
    • File system?
  • Create a mechanism to clean up curation output
  • Allow parameters to be passed to a task
  • Expose a REST endpoint
    • to initiate a curation task (with / without parameters)
    • to queue a curation task (with / without parameters)
    • review curation task output


  • No labels

5 Comments

  1. As discussed in the dev meeting today, this relates to https://github.com/DSpace/Rest7Contract/pull/17 The goals is to make the API flexible enough so that it can be used to invoke or schedule curation tasks and (some) CLI commands of the launcher.xml file


    The contract already tries to provide the following features of your proposal:

    • Allow persistence of output (to file system)
    • Create a mechanism to clean up curation output
    • Allow parameters to be passed to a task (as CLI commands also have different parameters)


    Currently the contract does not mention any output type or restrictions (HTML, JSON, plain text). How would you design this?

    1. Tom Desair, I will move this discussion to the PR.

      I see that you have addressed the ability to pass parameters to a script and to set an output file for a script.

      I propose that this capability be pushed all the way into the curation infrastructure.  If this was done, then any curation task added to this system would also be invokable as a cli script.

      Regarding the output file, I suppose that each script should produce the output that makes sense. 

      When running as task as a CLI, perhaps the following rule makes sense: if the output file is TXT and it is under a certain size, dump the output to the terminal.  Otherwise, dump a URL to the output file.

  2. Notes from Dev Meeting on 8/8/2018

    Tim Donohue [8:25 AM]
    Sure, we can move on to Curation System though, if you'd rather.

    This one already has a wiki page of brainstorms: https://wiki.duraspace.org/display/~terrywbrady/Curation+System+Needs

    Terry Brady [8:26 AM]
    When I first started working in DSpace, I needed to develop some simple extensions - mostly reporting stuff.
    Curation tasks seemed like an easy way to develop and deploy a simple add-on.
    But, there are some limitations to the current curation process. It does not take parameters (other than a scope handle) and it does not really persist output.
    I *think* that many CLI and Admin functions could be reduced to curation tasks if the input/output issues around curation were resolved.

    Mark Wood [8:28 AM]
    I recall something like parameters for tasks, but it's really hard to find any information about them.

    Terry Brady [8:28 AM]
    As we moved to the REST7 api, it will become more complicated to make features available both to (1)the Angular UI and (2) the CLI interface. Perhaps curation could solve this.
    The only params I remember for curation are (1)write text output to STDOUT and (2)run immediately vs queue for later

    Mark Wood [8:30 AM]
    They're called Task Properties.

    Tim Donohue [8:30 AM]
    Most curation params are actually *configuration*
    (so, it's accurate to say you cannot pass params on commandline or similar)

    Mark Wood [8:30 AM]
    It's not quite the same thing, but it does allow configuring the same task code to be run in more than one way.

    Alexander Sulfrian [8:31 AM]
    Yes, task properties are a workaround for missing parameters.

    Terry Brady [8:31 AM]
    That is good to know.

    Mark Wood [8:31 AM]
    I can see that properties may not be flexible enough.

    Tim Donohue [8:32 AM]
    In any case, I agree that Curation Tasks are limited...especially in output format. And that they take input more from configuration (instead of params)

    Terry Brady [8:32 AM]
    I remembered my brainstorming on this as a possible way to address @mwood’s bulk operations needs.

    Tim Donohue [8:34 AM]
    Regarding Curation Task output, I think the most logical extension there would be to support JSON output. To bring Curation Tasks to the REST API would require either the current text output (embedded in JSON) or straight JSON output
    I don't see as much usefulness though to HTML or XML output...as our REST API speaks entirely JSON, and the DSpace 7 UI can always format that output into HTML

    Mark Wood [8:35 AM]
    Tasks have been rather free to write anything they like. Other than wrapping strings to make them legal JSON, it may take a lot of work (and cramp a lot of style) to structure the output.

    Terry Brady [8:35 AM]
    That makes sense. We might also want to generate some html fragments as reports.

    Alexander Sulfrian [8:36 AM]
    Would be good if the UI/reporting after running a curation task on multiple items can be improved.

    Mark Wood [8:36 AM]
    If we get structured stuff out, it can be transformed to HTML or anything else.

    Tim Donohue [8:36 AM]
    @sulfrian: yes, I think that's caused mostly by the text-based output format. It's hard to display plain text in a UI in a "pretty way"

    Terry Brady [8:37 AM]
    Sounds good. As long as link-like things can be written out, the format is less important.

    Alexander Sulfrian [8:37 AM]
    @tdonohue Currently only the result of the last item is displayed. That's a bit unexpected for users.

    Tim Donohue [8:39 AM]
    @sulfrian: I think that's a result of the lack of persistence of the output....you get the full output on commandline (as it's written out as each item is processed). In the UI though, it's hard to write output during processing without Javascript/dynamic output.
    If we persisted the output, we could load it all together and provide a full view of the output (in the UI)
    Or, with the Angular UI, we might be able to build the dynamics here a bit better than the current UIs

    Mark Wood [8:41 AM]
    Sounds like the first thing we need is to replace setResult(String) with addResult(String). On the console it just writes to the console; in a webapp. it is accumulated somewhere, or fed out via AJAX or whatever. (edited)

    Terry Brady [8:41 AM]
    It could be useful to share some of this discussion in the DSpace7 meeting to see if this approach could make any of that development work easier...

    Tim Donohue [8:41 AM]
    This is a good discussion to have now though, as it's not a feature that is enabled/built yet in DSpace 7.
    I agree these ideas should be documented somewhere for DSpace 7 team. I'm not sure it should go into a DSpace 7 meeting yet though, until we are ready to work on it. But, we could create a ticket for DSpace 7 REST API to discuss implementation ideas

    Mark Wood [8:44 AM]
    Likewise Curator.getResult() should return List<String>.

    Tim Donohue [8:45 AM]
    The reality here though is that it's highly likely we *won't* be able to rebuild or heavily enhance Curation Tasks in DSpace 7 (we just don't have the time to redesign everything)....but, if there are minor enhancements necessary to help make it work better for REST / Angular, those could/should happen
    And it sounds like we've identified at least a few minor enhancements.... namely persisting the output (or feeding via AJAX like streams)....and possibly looking towards a JSON output format

    Terry Brady [8:46 AM]
    (I need to step away for 5 min. I will rejoin you all in a moment)

    Tim Donohue [8:46 AM]
    No worries
    Ok, so it sounds like we are wrapping up this discussion. I think the task here is to create a ticket (or two) on implementation ideas for the DSpace 7 team

    Mark Wood [8:47 AM]
    I'm not sure we can do much more than {"string", "string"...}

    Tim Donohue [8:48 AM]
    @mwood: we may not be able to. I'm uncertain as well...but I think we should be able to "stream" updates to the Angular UI (to allow it to "persist" output at least until the task completes)
    I can write that up in an Angular UI ticket as an idea/brainstorm (and link in this discussion)

    Terry Brady [8:49 AM]
    I am back...

    Mark Wood [8:50 AM]
    Curation really doesn't have much structure other than "ran task on object 1; ran task on object 2...."

    Terry Brady [8:51 AM]
    In some instances, we will want simple feedback from curation (a message) and in some instances we will want feedback persisted (a report that requires follow-up action). It would be nice to have a curation system option that could do either.

    Tim Donohue [8:52 AM]
    @terrywbrady: I think I agree. I think we need to separate here what is "doable in DSpace 7" versus what is likely "future enhancements"
    I suspect the doable in DSpace 7 is more about taking the current system & making sure the UI is better (i.e. streaming results to the UI, so that it can display them all in a nice format)

    Mark Wood [8:53 AM]
    An AbstractCurationTask has several getXXXProperty() methods, and these could be extended with "dynamic" properties that are taken from the request rather than configuration. I think tasks could easily not care how their properties were set.

    Tim Donohue [8:53 AM]
    Future enhancements could include a bigger overall to find a place to persist reports (more permanently), etc

    Mark Wood [8:54 AM]
    Commandline task runs can already save reports wherever they like. Where would a GUI run usefully save reports? Probably just build a document that can be saved by the browser.

    Tim Donohue [8:55 AM]
    @mwood: if a report were saved in a semi-structured format on the backend, then the front-end should be able to transform it into JSON (for the UI) in the same way that it would do so for a "live" task.
    But, I think that's likely out-of-scope for DSpace 7 timelines...nonetheless, it's worth thinking about / brainstorming in a Wiki page for future enhancements

    Terry Brady [8:56 AM]
    Since curation pushes some tasks to a background queue, there may not always be a UI to persist.

    Mark Wood [8:56 AM]
    The question is: where does it go? Format as JSON or whatever you wish, then make it an "attachment". I forget the details, but I've done that before.

    Terry Brady [8:56 AM]
    (Tim, no need to repeat the caution about DSpace 7 scope)

    Mark Wood [8:57 AM]
    Background: good point. We'd need a place to store reports, then.
    I think the original idea may have been "just log the details." There's some special support for logging. But the logs are a junkpile already....

    Terry Brady [8:58 AM]
    I like that the notion of foreground/background is already there. A developer does not need to decide which approach to use until execution time.

    Tim Donohue [8:58 AM]
    @mwood: To be honest, if the output is plain text...it could be a plain text file (like a log file). If the REST API knows to read each line of that file and "stream" to the Angular UI, it could look very similar to what would be streamed in "live" output
    But, that's just if the output remains plain text. We also could define a more structured format for the output (JSON or similar)
    I think there's promise here on incremental improvement to this.... first, in DSpace 7, get the full results "streamable" to the UI (so they all can be displayed, just like you'd see in STDOUT). Then, in future, find a place to archive those results on backend, and "stream" from archived location

    Pablo Prieto [9:00 AM]
    Hi all

    Tim Donohue [9:01 AM]
    Hi @Pablo Prieto

    Mark Wood [9:01 AM]
    If you're going to trigger a curation run in the GUI, walk away, and review the results later *in the GUI* then we'll need to issue a "run identifier" that you can copy/paste, save, and copy/paste later to retrieve the results.

    Tim Donohue [9:02 AM]
    In any case, I'm realizing we are now at the top of the hour. I think the next steps here are to (1) create a implementation brainstorm ticket for DSpace 7... (2) Update the wiki page proposal (for future enhancements) with some of these ideas
    I can create the DSpace 7 ticket (I think it's likely an Angular ticket initially, until we get a better handle on what would need to happen in REST API)

    Terry Brady [9:03 AM]
    I will capture these notes on the wiki page. I probably will not have time to organize them.

    Mark Wood [9:05 AM]
    I'd just like to say: don't GUIfy task output too quickly. Let the UI render it; retain any structure somewhat abstractly. I tend to want to build pipelines in scripts rather than sit there and drive everything manually.

    @mwood: I think that'll likely happen naturally.  I don't think we'll have time to change task output in DSpace 7, TBH.  So, output format changes may need to wait.  But, I think, with Angular, we should be able to "stream" task output to the UI...so that the Angular UI output looks more like STDOUT output.

  3. The curation system seems a bit incomplete in terms of reporting.  Curator.report(String), for example, does nothing unless setReporter(String) has been passed "-", in which case it writes to standard output.  Probably setReporter should be passed a Writer or even an Appendable, or perhaps some List, and just assume that this will send output somewhere appropriate.  It would be up to the code instantiating Client to set a useful reporting sink – the CLI runner would send it to standard output, the GUI runner would gather it and ship it back to the browser, and the queue runner might write it to a file on the server.

    I think that task authors could use some guidance about what to log, what to report, what to set as result, etc. and what happens when you do.

  4. Curator could hold a Map of per-run parameters.  AbstractCurationTask has access to the Curator instance and its parameter methods could lookup in that Map before consulting the DSpace configuration.  Task code would need no modification; just ask for parameters and don't worry about whence came their values.  The queued runner would need a place to stash the MapEntrys.