Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

<?xml version="1.0" encoding="utf-8"?>
<html>
<p>author: Grace Carpenter

date: August 2006</p>
<p>TechMDExtractor is a command-line tool for running Jhove on
the DSpace asset store. It will determine if each
bitstream is a valid and/or well-formed instance of the format it
purports to be. If an identifier is specified, processing will be
limited to the given Community, Collection, or Item. If verbose
processing is specified, all the extracted technical metadata will be
sent to standard output.</p>

About the Design

<p>In order to make Jhove work with DSpace, I had to create two classes
that wrap two of the main Jhove classes. These classes,
org.dspace.app.techmdextractor.jhove.DSJhoveBase and
org.dspace.app.techmdextractor.jhove.DSConfigHandler, essentially
re-write the code in the corresponding Jhove classes
(edu.harvard.hul.ois.jhove.JhoveBase and
edu.harvard.hul.ois.jhove.ConfigHandler, respectively). DSJhoveBase
initializes the Jhove modules, and also provides the main
entry points for DSpace to parse bitstreams. DSConfigHandler has
code to parse the DSpace-specific elements of the Jhove
configuration file (jhove.conf).</p>

Configuring TechMDExtractor

<ol>
<li><p>Apply the dspace-preingest patch to your DSpace installation,
and follow the instructions for configuring it. (TechMDExtractor has some build-time
dependencies on the Pre-ingest project--it implements two of its
interfaces:

org.dspace.workflow.PreIngestFilter and
org.dspace.workflow.FilterResult.)</p>

<li><p>Check out the TechMDExtractor project from CVS.</p></li>
<li><p>Modify the file

TechMDExtractor/config/jhove.conf

to
reflect the specifics of your DSpace installation. In particular,
the things that <strong>must</strong>

be modified are:</p>
<ul>
<li>the <tempDirectory> element must contain a directory with appropriate
permissions for the Jhove executable to write to </li>
<li>the <dspace:format-name> element that follows each
<module>/<class> element
must contain the short description of the format as it appears in your
bitstreamformatregistry table.

Jhove.conf

contains the default short
descriptions for DSpace formats, so you don't have to worry about this
if you haven't edited the bitstreamformatregistry table.</li>

</ul>
</li>
<li><p>If you wish, configure logging for the non-DSpace-specific code
in Jhove by editing

TechMDExtractor/config/jhoveLogging.properties

.</p>
<p>Note that the Jhove code actually uses two different logging APIs:
java logging for most of Jhove, and log4j for the DSpace-specific
initialization and top-level execution code. For debugging set-up problems,
you should be able to get most of the information you need from the
regular DSpace logs. If you want to debug format-specific parsing issues,
you should modify the file

TechMDExtractor/conf/jhoveLogging.properties

, which
will be placed in your

<i>[dspace]</i>/config/

directory
at build-time.</p>
</li>
<li><p>From the TechMDExtractor directory, type

ant install

. After
the build process has completed, verify that the following jars
are in your

<i>[dspace]</i>/lib

directory:</p>

<ul>
<li>

tmdExtractor.jar

</li>
<li>

jhove.jar

</li>
<li>

jhove-handler.jar

</li>
<li>

jhove-module.jar

</li>
</ul>

<p>Running

ant install

should also place the above jars
in your

<i>[dspace-source]</i>/lib

directory, for
use in the Workflow Pre-ingest step.</p>

<p>The files

TechMDExtractor/config/jhove.conf

and

TechMDExtractor/config/jhoveLogging.properties

should have been
copied into your

<i>[dspace]</i>/config

directory.</p>
</li>
<li><p>Don't forget that the dspace.cfg file in your

<i>[dspace]</i>/config

directory must be modified,
as specified in the Workflow Pre-ingest instructions.</p>

<p>Note that the Jhove initialization code
(in

org.dspace.app.techmdextractor.jhove.JhoveExtractor

) also
checks for the configuration variable

jhove.sax.class

. This
is because I always get errors when parsing the jhove configuration
file, although they don't cause the code to fail. See the "Known Issues"
section of the documentation for more information.</p>
</li>
</ol>

Running TechMDExtractor

<p>From the

<i>[dspace]</i>/bin

directory, type</p>

dsrun org.dspace.app.techmdextractor.ExtractorManager -h

<p>You'll get a list of command-line options for running the program. Note
that the code for the TechMDExtractor is based heavily (OK, stolen(wink) )
from the MediaFilter code, so many of the options are similar.</p>

Files Changed

<ul>
<li>config/dspace.cfg</li>
</ul>

Files Added

The source code may be found online under CVS here:

http://libaxis1.mit.edu/viewcvs/sandbox/TechMDExtractor/

<ul>
<li>config/jhove.conf</li>

<li>config/jhoveLogging.properties</li>
<li>src/org/dspace/app/techmdextractor/jhove/DSConfigHandler.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/DSJhoveBase.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveExtractor.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveFilterResult.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhovePreIngestFilter.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveTechMD.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/ExtractorManager.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/TechMDExtractorException.java</li>

<li>build.xml</li>
</ul>

Known Issues

<ul><li><p>SAXParser problem: the SAX parser complains when it
parses the jhove.conf file. The messages I get are:</p>

[Warning] jhove.conf:6:39:SchemaLocation: schemLocation value 
= 'http://hul.harvard.edu/oi/xml/xsd/jhove/1.3/jhoveConfig.xsd' must
have even number of URI's.
[Error] jhove.conf:6:39: cvc-elt.1: Cannot find the declaration of
element 'jhoveConfig'

<p>If you use Jhove 'out of the box', you won't receive these errors.
I believe that Jhove as a stand-alone uses the default Java SAX
parser (Crimson?), whereas DSpace is using Xerces. It seems that
the different parsers probably need to be configured differently. I
don't think the error messages are a problem for the
config file, but I'm not sure how
this affects the parsing of XML docs submitted to Jhove.
I started to play around with this, and the TechMDExtractor
code actually checks the dspace.cfg file to see if a
parser is specified (

jhove.sax.class=<i>sax parser name</i>

).
Needs investigation.</p>

</li>
</ul>

</html>

  • No labels