Following Scott Yeadon and Larry Stone's suggestion that we use the plugin framework for format identification and validation tools, I began looking into how we might fit alternatives to JHOVE into this framework. At the moment I believe there is only one other serious competitor, the UK National Archive's DROID tool (as far as I can tell, the National Library of New Zealand's Metadata Extractor is not being maintained/updated). DROID looks promising, but its possible integration into DSpace brings up a number of questions. The tools are different enough that I thought I should back up and re-visit some of the basic issues with format identification. As per Larry's suggestion, below are some scenarios/use cases involving file formats (sorry, I'm too lazy to make these into formal use cases). Please feel free to insert comments.

Scenarios:

The browser needs the MIME type to display the document correctly.
A user is looking for data files on a particular topic. She finds a file that doesn't have an extension, or that has an extension she doesn't recognize. Knowing the format helps her identify whether the file contains data or narrative before going to the trouble of opening it.
A user wants to look at an image stored in DSpace. The browser opens the file with the default application for that format, but the user wants to know the format so that he can choose different software for viewing it.
The sys admin needs to review the license file for a particular item. The license "format" helps him rapidly identify the correct file. (Note: couldn't he get this information just as easily from the name of the file?)
The sys admin needs to identify all instances of an obsolete version of a format, so that she can migrate them to a current version.
The sys admin has a file of a format that's not in the DSpace format registry, and wants to register the format so that DSpace will be able to correctly identify the MIME type in the future.
A user finds a file that appears to be of a particular MIME type, but can't open it with the current version of the software associated with that MIME type. The file appears to be in an obsolete version of the format. The user or sys admin needs to know which version it is so that he can either use different software or migrate the file.
A user's browser doesn't recognize the format of a file she wants to access. She needs to know the format so that she can install the necessary software to open the file.
A user submits an item to DSpace. The format identification tool in DSpace cannot unambiguously identify the file format, so DSpace must present a list of possible formats from which the user can choose, or provide a means for the user to enter a new format.
A METS package is transferred from one instance of DSpace to another. The DSpace instances must have the same format identifiers for the format types referred to in the METS package, or there must be a way for the ingesting instance of DSpace to add the relevant format identifiers if they don't exist in its registry.

Issues

Format Identifiers

In the future there may be one standard scheme for identifying formats, which will all be described in the Global Digital Format Registry... we hope. Or perhaps PRONOM's PUIDs (persistent unique identifiers) will be used. Until then, we're stuck with the problem that each tool has its own list of formats, and its own scheme for format identifiers. How do we map the different format identifiers to our own?

For Jhove, which deals with 12 formats, this isn't a big issue. DROID, however, currently has over 100 formats to which it has assigned PUIDs (unique identifiers), and will probably have several hundred more in the future. Although DROID's creators hope to create a web service to access the format database, I don't think this is something we should plan on relying on anytime soon.

MIME types seem like the most obvious option for going between different systems. At the moment they are the closest thing to a globally accepted scheme for format identifiers. Although MIME types do not have sufficient granularity for preservation purposes (e.g., no version information), they provide most of what DSpace needs for current, everyday functioning. Nevertheless there are some drawbacks to using MIME types:

1) Some formats may not have a MIME type.

For example, the files that comprise an ESRI shapefile (.shp, .dbf) apparently do not have a MIME type.

2) Two (or more) formats may have the same MIME type.

Right now this is an issue with internal formats; are there any other formats for which this is a problem?

3) DROID doesn't output MIME types, at least for the time being.

It seems like it might be possible to persuade the DROID authors to include them, or change the code ourselves.

Format Descriptions

What text do we use to describe the format to the user? Right now we use the short_desc field of bitstream_format. If we import many of DROID's formats, adding the short_desc text to each format could be time-intensive. Would MIME type work, if no short_desc is immediately available?

JHOVE Issues

Identification, validation, and technical metadata extraction are all done as part of a single process. Can we develop an architecture that will separate identification and validation into two steps (not sure about metadata extraction), but not repeat work unnecessarily?

DROID Issues

In certain situations DROID may return several different format ID's for one bitstream. For instance, this is the output I got for an Excel 2002 file:

<?xml version="1.0" encoding="UTF-8"?>
<FileCollection DROIDVersion="V1.0" SigFileVersion="9" DateCreated="2006-04-24T16:31:23">
  <IdentificationFile Name="c:\docs\dpmatrix.xls" IdentQuality="Positive" Warning="" >
    <FileFormatHit HitStatus="Positive (Generic Format)" FormatName="Binary Interchange File Format (BIFF) Workbook" FormatVersion="8" FormatPUID="fmt/61" HitWarning="" />
    <FileFormatHit HitStatus="Positive (Generic Format)" FormatName="Binary Interchange File Format (BIFF) Workbook" FormatVersion="8X" FormatPUID="fmt/62" HitWarning="" />
    <FileFormatHit HitStatus="Positive (Specific Format)" FormatName="OLE2 Compound Document Format" FormatPUID="fmt/111" HitWarning="Possible file extension mismatch" />
  </IdentificationFile>
</FileCollection>

(I'm a little mystified as to why the OLE2 ID is considered more specific than BIFF.) How would we choose which format to map to a DSpace BitstreamFormat?