Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

<?xml version="1.0" encoding="utf-8"?>
<html>
<p>author: Grace Carpenter

date: August 2006</p>
<p>TechMDExtractor is a command-line tool for running Jhove on
the DSpace asset store. It will determine if each
bitstream is a valid and/or well-formed instance of the format it
purports to be. If an identifier is specified, processing will be
limited to the given Community, Collection, or Item. If verbose
processing is specified, all the extracted technical metadata will be
sent to standard output.</p>

About the Design

<p>In order to make Jhove work with DSpace, I had to create two classes
that wrap two of the main Jhove classes. These classes,
org.dspace.app.techmdextractor.jhove.DSJhoveBase and
org.dspace.app.techmdextractor.jhove.DSConfigHandler, essentially
re-write the code in the corresponding Jhove classes
(edu.harvard.hul.ois.jhove.JhoveBase and
edu.harvard.hul.ois.jhove.ConfigHandler, respectively). DSJhoveBase
initializes the Jhove modules, and also provides the main
entry points for DSpace to parse bitstreams. DSConfigHandler has
code to parse the DSpace-specific elements of the Jhove
configuration file (jhove.conf).</p>

Configuring TechMDExtractor

<ol>
<li><p>Apply the dspace-preingest patch to your DSpace installation,
and follow the instructions for configuring it. (TechMDExtractor has some build-time
dependencies on the Pre-ingest project--it implements two of its
interfaces:

Panel

org.dspace.workflow.PreIngestFilter and
org.dspace.workflow.FilterResult.)</p>

<li><p>Check out the TechMDExtractor project from CVS.</p></li>
<li><p>Modify the file

Code Block
TechMDExtractor/config/jhove.conf

to
reflect the specifics of your DSpace installation. In particular,
the things that <strong>must</strong>

be modified are:</p>
<ul>
<li>the <tempDirectory> element must contain a directory with appropriate
permissions for the Jhove executable to write to </li>
<li>the <dspace:format-name> element that follows each
<module>/<class> element
must contain the short description of the format as it appears in your
bitstreamformatregistry table.

Code Block
Jhove.conf

contains the default short
descriptions for DSpace formats, so you don't have to worry about this
if you haven't edited the bitstreamformatregistry table.</li>

</ul>
</li>
<li><p>If you wish, configure logging for the non-DSpace-specific code
in Jhove by editing

Code Block
TechMDExtractor/config/jhoveLogging.properties

.</p>
<p>Note that the Jhove code actually uses two different logging APIs:
java logging for most of Jhove, and log4j for the DSpace-specific
initialization and top-level execution code. For debugging set-up problems,
you should be able to get most of the information you need from the
regular DSpace logs. If you want to debug format-specific parsing issues,
you should modify the file

Code Block
TechMDExtractor/conf/jhoveLogging.properties

, which
will be placed in your

Code Block
<i>[dspace]</i>/config/

directory
at build-time.</p>
</li>
<li><p>From the TechMDExtractor directory, type

Code Block
ant install

. After
the build process has completed, verify that the following jars
are in your

Code Block
<i>[dspace]</i>/lib

directory:</p>

<ul>
<li>

Code Block
tmdExtractor.jar

</li>
<li>

Code Block
jhove.jar

</li>
<li>

Code Block
jhove-handler.jar

</li>
<li>

Code Block
jhove-module.jar

</li>
</ul>

<p>Running

Code Block
ant install

should also place the above jars
in your

Code Block
<i>[dspace-source]</i>/lib

directory, for
use in the Workflow Pre-ingest step.</p>

<p>The files

Code Block
TechMDExtractor/config/jhove.conf

and

Code Block
TechMDExtractor/config/jhoveLogging.properties

should have been
copied into your

Code Block
<i>[dspace]</i>/config

directory.</p>
</li>
<li><p>Don't forget that the dspace.cfg file in your

Code Block
<i>[dspace]</i>/config

directory must be modified,
as specified in the Workflow Pre-ingest instructions.</p>

<p>Note that the Jhove initialization code
(in

Code Block
org.dspace.app.techmdextractor.jhove.JhoveExtractor

) also
checks for the configuration variable

Code Block
jhove.sax.class

. This
is because I always get errors when parsing the jhove configuration
file, although they don't cause the code to fail. See the "Known Issues"
section of the documentation for more information.</p>
</li>
</ol>

Running TechMDExtractor

<p>From the

Code Block
<i>[dspace]</i>/bin

directory, type</p>

Code Block
dsrun org.dspace.app.techmdextractor.ExtractorManager -h

<p>You'll get a list of command-line options for running the program. Note
that the code for the TechMDExtractor is based heavily (OK, stolen(wink) )
from the MediaFilter code, so many of the options are similar.</p>

Files Changed

<ul>
<li>config/dspace.cfg</li>
</ul>

Files Added

The source code may be found online under CVS here:

http://libaxis1.mit.edu/viewcvs/sandbox/TechMDExtractor/

<ul>
<li>config/jhove.conf</li>

<li>config/jhoveLogging.properties</li>
<li>src/org/dspace/app/techmdextractor/jhove/DSConfigHandler.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/DSJhoveBase.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveExtractor.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveFilterResult.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhovePreIngestFilter.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveTechMD.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/ExtractorManager.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/TechMDExtractorException.java</li>

<li>build.xml</li>
</ul>

Known Issues

<ul><li><p>SAXParser problem: the SAX parser complains when it
parses the jhove.conf file. The messages I get are:</p>

Code Block
[Warning] jhove.conf:6:39:SchemaLocation: schemLocation value 
= 'http://hul.harvard.edu/oi/xml/xsd/jhove/1.3/jhoveConfig.xsd' must
have even number of URI's.
[Error] jhove.conf:6:39: cvc-elt.1: Cannot find the declaration of
element 'jhoveConfig'

<p>If you use Jhove 'out of the box', you won't receive these errors.
I believe that Jhove as a stand-alone uses the default Java SAX
parser (Crimson?), whereas DSpace is using Xerces. It seems that
the different parsers probably need to be configured differently. I
don't think the error messages are a problem for the
config file, but I'm not sure how
this affects the parsing of XML docs submitted to Jhove.
I started to play around with this, and the TechMDExtractor
code actually checks the dspace.cfg file to see if a
parser is specified (

Code Block
jhove.sax.class=<i>sax parser name</i>

).
Needs investigation.</p>

</li>
</ul>

</html>