JIRA Reference: https://jira.duraspace.org/browse/DS-638

Proposed: "This patch uses JHOVE to provide rough-and-ready format checking by identifying that the file/bitstream extension matches formats verifiable by JHOVE. (Currently DSpace accepts a deposit's file extension as gospel, so a user could tack a ".txt" extension onto a GIF and DSpace would assign the incorrect format to the file based on that incorrect extension.) This patch also also contains code to check the file for the presence of viruses."

DCAT review:  This patch seems to be doing two things: a) integrate with JHOVE, something that would be of strong interest for any repository that aim to preserve its contents in the long term and b) use virus checking tools (based on ClamAv) as part of that process. One can imagine that it would be interesting to have the virus tools without using the JHOVE package, so it may be worth exploring separating these? It may well be that it would be useful to encourage a community discussion about what tools would be useful, now that the curation framework is there (although I haven't had the time to check what it actually does).

The ticket is already assigned to Richard Rodgers as he will need to assess how this would work with the new curation framework that came with 1.7.

DCAT initial assessment: Relevant; Medium-High or High priority

Next steps: Initially it would be useful to check with Richard Rodgers what his take on this is (which I'm happy to do). Also, as Jim is also from Michigan, and the proposal seems to originate there, he may be able to provide more detail?

  • If you agree with the above assessment and have no additional comments, you can simply respond with a +1.
  • If you disagree but have no comments, a -1 works, and if you have no opinion at all, 0 is fine. (And encouraged, since that means we know you've had a chance to weigh in.)
  • If you do have comments or other ideas, you're not limited to the numbers, of course. So please do share your thoughts!
  • No labels

7 Comments

  1. +1 (though I have an obvious bias and vested interest regarding this one, since we developed it here)

  2. +1  I dont want to sidetrack this conversation but I would like to propose wider changes. I would like to see the file upload moved to be the first step in the submission process. I would like to see its file type verified as described above and I would like to see it checked for viruses. Then, depending on the file and package types,  we could extract metadata from the uploaded file and prepopulate the item metadata. There are a number of package types to which this would apply eg Mets, SCORM, but perhaps the most interesting are the docx zip files that can now be output from Microsoft Word using the Author plugin.

    Most of the functionality to do this is already available in DSpace, it just needs pulled together. I would happily volunteer to undertake the work for 1.8 as its in line with the requirements of my institution.   

  3. Just a note to mention that the Developers discussed this DCAT review on Feb 23, 2011. Elin Stangeland was also in attendance

    Full discussion thread is available in our IRC logs for that day: http://irclogs.duraspace.org/index.php?date=2011-02-23

    Here's a brief summary of what we came up with:

    • First, this request is actually 2 separate requests: (1) Integrated Virus Scanning, and (2) Integrated File Format Verification
    • The DSpace 1.7.0 Curation System already come with a Virus Scanning tool (requires & uses ClamAV virus scanner: http://www.clamav.net)
    • MIT (Richard Rodgers) is also currently working on a Curation Task to perform File Format Verification via DROID (http://www.nationalarchives.gov.uk/PRONOM/). This will be released in 1.7.1 or 1.8.0.
    • The 1.7.0 Curation System currently is only integrated into the DSpace Administration UI and DSpace Workflow Approval processes (it can also be kicked off via the Commandline). Any curation task can be kicked off automatically during a Workflow approval step, or "on demand" via the Administration UI or commandline.
    • So, one feature that is definitely missing is the ability to kick off these Curation tasks automatically during the DSpace Submission UI
    • Robin Taylor is currently looking into similar Submission UI changes (and also wants to validate a file during submission), so he volunteered to investigate some of this work. Elin & Robin agreed to get in touch about this work.
  4. To conclude on this issue - the virus part of this patch has been picked up and amended slightly to allow virus checks on deposit in addition to workflow. It should be included in DSpace 1.8 (Robin Taylor is responsible for this release).

    The file format verification will be taken forward as well, I'm beeing told that there are work going on using Droid initially within the curation framework (if this is not the case then this patch will be brought forward instead. Jhove is more extensive than Droid (in fact it utilises Droid for some tasks) so is still worth considering in the longer term. I'll keep an eye on things, also on what is going on with the Jhove2 developments, but hope that everyone is ok with me (and Robin) closing this particular discussion.