Choosing a Tool for Object Identification, Validation, and Metadata Extraction

_Grace Carpenter, MIT,
November 2004_

The DSpace@Cambridge team has chosen JHOVE as the tool to integrate into DSpace for object identification, validation, and metadata extraction, as part of the Digital Preservation Project.

We evaluated two tools: the National Library of New Zealand's Metadata Extractor (referred to here as NLNZ), and Harvard University Library's JHOVE. As far as we know, these are the only two tools that are Open Source (or plan to be) and that provide identification and metadata extraction for digital objects.

The tools have some basic similarities. Both have been designed with extensibility in mind: modules or adapters can be written for new formats and fairly easily added into the system. Both tools output the extracted metadata in XML (or plain text, in the case of JHOVE), which provides the possibility of reformatting or filtering. And both tools provide a GUI and a command line interface. Although I used the GUI at times during the evaluation process if it seemed faster, the tool's integration into DSpace will be purely programmatic, so I haven't included any comments on the GUI here.

Below are the main reasons that we felt JHOVE was a better fit for the project.

Formats Parsed/Identified

JHOVE

1. 1. 1. 1. AIFF
        
        ASCII
        
        Bytestream
        
        GIF
        
        JPEG
        
        JPEG2000
        
        PDF
        
        TIFF
        
        UTF8
        
        WAVE
        
        XML

NLNZ

1. 1. 1. 1. BMP
        
        WAV
        
        TIFF 6.0
        
        Word 2
        
        WordOle
        
        Default (bytestream)

The existing JHOVE adapters are primarily for non-proprietary formats, whereas NLNZ has chosen to focus more on proprietary/Microsoft formats. Since DSpace at MIT is choosing to support primarily non-proprietary formats, JHOVE's priorities are more in line with ours.

Format Identification Method

There are several ways to determine an object's format: by file extension (MIME type); by reading the "magic numbers" at the beginning of the file; by parsing an object in its entirety and determining if it's a valid instance of certain format; or some combination of the above. DSpace currently identifies objects by the file extension, which is not especially reliable, nor very specific (it doesn't, for example, include version information).

NLNZ also uses file extensions to identify an object's format, and therefore doesn't give us anything beyond what we already have for identification. JHOVE's default method of identifying an object's format is to iterate through its collection of modules/adapters and do a complete parse of the object with each adapter. If adapter X is able to parse the object successfully, then JHOVE determines that the object is of format X, and stops iterating through the adapters. It is also possible, though, to specify identification by magic number only.

How to definitively identify an object's format is still an open question; we will probably make DSpace configurable so that each DSpace instance can determine which method to use. The general consensus, though, is that using file extensions alone is incomplete and susceptible to error.

Extracted Metadata

Since there is only one format (TIFF) that is parsed by both NLNZ and JHOVE, I don't have much of a basis for comparison. I'm also in the unfortunate position of having to judge by quantity, rather than quality, of metadata, since I don't know the specifics of metadata for various formats. But here are my observations.

In the end I wasn't able to use even TIFF files as a basis for comparison, since I wasn't able to successfully parse a TIFF file with NLNZ. In one first case it appeared to be because the file was the file was in big-endian format, and NLNZ seems to only parse little-endian. In the other two cases, the log recorded an array-out-of-bounds exception, but with no further information. As a result I wasn't able to do any comparison of the actual metadata produced by the two tools. However, from glancing at NLNZ's TIFF DTD, my suspicion is that JHOVE extracts a more exhaustive set of metadata, which includes, for instance, NISO image data.

It looks to me as though JHOVE's design for handling extracted metadata in XML form is much more open-ended and therefore much more accommodating of future changes to format specifications. While NLNZ contains a DTD for each format which specifies each metadata element, JHOVE uses a schema that consists primarily of non-specific <value> tags associated with <property> tags.

Validation

Determining what makes an object a valid instance of a certain format is a complex issue. JHOVE has attempted to define this for each of the formats it supports, while NLNZ doesn't include any validation functionality.

Other issues

Documentation
NLNZ could be improved by the addition of Javadocs.

'Complex' objects

NLNZ provides the option of grouping files into 'complex objects' and then parsing them together, and combining the output. This seems like an appealing option, although I haven't really explored its utility.

Logging
NLNZ includes a home-grown logging class, but there doesn't seem to be a way to configure it. JHOVE provides no logging at all.

Communication, funding
The fact that JHOVE was created just a few miles from MIT was definitely a point in its favor, since it means we won't be having to deal (at least in the present project configuration) with cross-hemisphere communication. JHOVE also has good prospects for ongoing funding and development, whereas there's been little information about the future development of NLNZ.

Page tree

JhoveLNZComp