Following Scott Yeadon and Larry Stone's suggestion that we use the plugin framework for format identification and validation tools, I began looking into how we might fit alternatives to JHOVE into this framework. At the moment I believe there is only one other serious competitor, the UK National Archive's DROID tool (as far as I can tell, the National Library of New Zealand's Metadata Extractor is not being maintained/updated). DROID looks promising, but its possible integration into DSpace brings up a number of questions. The tools are different enough that I thought I should back up and re-visit some of the basic issues with format identification. As per Larry's suggestion, below are some scenarios/use cases involving file formats (sorry, I'm too lazy to make these into formal use cases). Please feel free to insert comments.

Scenarios:

Issues

Format Identifiers

In the future there may be one standard scheme for identifying formats, which will all be described in the Global Digital Format Registry... we hope. Or perhaps PRONOM's PUIDs (persistent unique identifiers) will be used. Until then, we're stuck with the problem that each tool has its own list of formats, and its own scheme for format identifiers. How do we map the different format identifiers to our own?

For Jhove, which deals with 12 formats, this isn't a big issue. DROID, however, currently has over 100 formats to which it has assigned PUIDs (unique identifiers), and will probably have several hundred more in the future. Although DROID's creators hope to create a web service to access the format database, I don't think this is something we should plan on relying on anytime soon.

MIME types seem like the most obvious option for going between different systems. At the moment they are the closest thing to a globally accepted scheme for format identifiers. Although MIME types do not have sufficient granularity for preservation purposes (e.g., no version information), they provide most of what DSpace needs for current, everyday functioning. Nevertheless there are some drawbacks to using MIME types:

1) Some formats may not have a MIME type.

For example, the files that comprise an ESRI shapefile (.shp, .dbf) apparently do not have a MIME type.

2) Two (or more) formats may have the same MIME type.

Right now this is an issue with internal formats; are there any other formats for which this is a problem?

3) DROID doesn't output MIME types, at least for the time being.

It seems like it might be possible to persuade the DROID authors to include them, or change the code ourselves.

Format Descriptions

What text do we use to describe the format to the user? Right now we use the short_desc field of bitstream_format. If we import many of DROID's formats, adding the short_desc text to each format could be time-intensive. Would MIME type work, if no short_desc is immediately available?

JHOVE Issues

Identification, validation, and technical metadata extraction are all done as part of a single process. Can we develop an architecture that will separate identification and validation into two steps (not sure about metadata extraction), but not repeat work unnecessarily?

DROID Issues

In certain situations DROID may return several different format ID's for one bitstream. For instance, this is the output I got for an Excel 2002 file:

<?xml version="1.0" encoding="UTF-8"?>
<FileCollection DROIDVersion="V1.0" SigFileVersion="9" DateCreated="2006-04-24T16:31:23">
  <IdentificationFile Name="c:\docs\dpmatrix.xls" IdentQuality="Positive" Warning="" >
    <FileFormatHit HitStatus="Positive (Generic Format)" FormatName="Binary Interchange File Format (BIFF) Workbook" FormatVersion="8" FormatPUID="fmt/61" HitWarning="" />
    <FileFormatHit HitStatus="Positive (Generic Format)" FormatName="Binary Interchange File Format (BIFF) Workbook" FormatVersion="8X" FormatPUID="fmt/62" HitWarning="" />
    <FileFormatHit HitStatus="Positive (Specific Format)" FormatName="OLE2 Compound Document Format" FormatPUID="fmt/111" HitWarning="Possible file extension mismatch" />
  </IdentificationFile>
</FileCollection>

(I'm a little mystified as to why the OLE2 ID is considered more specific than BIFF.) How would we choose which format to map to a DSpace BitstreamFormat?