Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Change HTML <tt> blocks to Confluence monospace braces so they format properly

...

Here is a summary of all of the proposed changes, by section.

Content Model

...

Bitstream

Each <tt>Bitstream</tt> Bitstream still refers to a <tt>BitstreamFormat</tt> BitstreamFormat object to identify its data format. In addition, the <tt>Bitstream</tt> Bitstream gains two new properties:

  1. A format confidence metric, which indicates (on a coarse symbolic scale) the certainty of the identification of its format, reflecting both accuracy and precision.
  2. The source of the format identification, indicating the tool or mechanism responsible for the format technical metadata in this Bitstream.

...

BitstreamFormat

Although outwardly similar and largely backward-compatible, the <tt>BitstreamFormat</tt> BitstreamFormat has been completely gutted and re-implemented. It now serves as a local "cache" of format technical metadata and holds one or more external format identifiers, each of which refers to a complete technical metadata record in an external data format registry.

...

We add a plugin interface to provide access to external data format registries. Each registry is modeled as an implementation of the <tt>FormatRegistry</tt> FormatRegistry interface. It is fairly simple; it only supports "importing" a format description into the local <tt>BitstreamFormat</tt> BitstreamFormat cache, updating an existing format, and a few queries.

...

Automatic Format Identification

The old <tt>orgorg.dspace.content.FormatIdentifier</tt> FormatIdentifier is replaced by a configurable, extensible, plugin-based format identification framework. It is not part of the format registry plugin, because while some format recognition services live in a registry's software suite, others are independent of any registry.

...

The following exceptions are thrown when a fatal error is encountered in the format registry and identification framework. They are similar in meaning to existing exceptions in the DSpace API, such as <tt>AuthorizeException</tt> AuthorizeException – signalling a fatal error with enough context and explanation to communicate the cause to the user or administrator.

...

FormatRegistryException

Sent when there is a fatal error while accessing an external format registry or updating the local cache of format metadata in the DSpace RDBMS. Can be caused by incorrect or missing configuration entries, network problems, filesystem problems, etc.

...

FormatRegistryNotFoundException

A subclass of <tt>FormatRegistryException</tt> FormatRegistryException, this exception is sent in particular cases when looking up an external identifier fails although it should have been found (e.g. since it had been found before). In the common case of looking up an identifier for the first time, e.g. through <tt>BitstreamFormatBitstreamFormat.findByIdentifier()</tt>, no exception gets thrown because failure is a possibly-expected result.

...

FormatIdentifierException

Thrown when a format identification method encounters a fatal error which would cause it to return a false negative result. For example, if its configuration is missing or incorrect, the method throws this exception rather than silently failing. Simply failing to identify a format is in the realm of expected results and does not cause an exception.

...

FormatIdentifierTemporaryException

Thrown by the format identification method when it fails because of a "temporary" problem, e.g. when a network resource is not available. This subclass of FormatIdentifierException tells the identifier manager that it may succeed when retried later.

...

Bitstream

The most significant change is that the Bitstream now remembers the confidence of its format identification, an enumerated value which indicates the certainty and source of its format identification. There is also a convenience method to access the automatic format identification: since it almost always used to set the Bitstream's format anyway, this improves code clarity.

Here is an interface view of the API additions:

...

FormatConfidence

Panel

// Ordered symbolic values of format-identification confidence:
// (See Automatic Format Identification section for details)
package org.dspace.content;
public enum FormatConfidence
{
// No format was identified. The unknown format is assigned.
UNIDENTIFIED,
//
// A format was found but it was based on "circumstantial evidence", i.e. external properties of the Bitstream such as its name or a MIME type attached to it on ingest.
CIRCUMSTANTIAL,
//
// The data format was determined by coarse examination of the contents of the Bitstream and comparison againt the known characteristics of generic formats such as plain text, or comma-separated-values files.
HEURISTIC,
//
// The format was identified by matching its content positively against an internal signature that describes a "generic" (supertype) format or family of formats.
POSITIVE-GENERIC,
//
// The format was identified by matching its content positively against an internal signature that describes a specific, precise, data format.
POSITIVE-SPECIFIC,
//
// Contents of Bitstream are validated as conforming to the identified
// format, this is the highest confidence reached by automatic identification.
VALIDATED,
//
// Format is derived from reliable technical metadata supplied at the time
// the Bitstream was ingested; if applied it is given a priority
// that overrides automatic identifications. Ingest-derived
// formats with a low level of confidence should be assigned CIRCUMSTANTIAL.
INGEST,
//
// Format was identified interactively by a user, which acts as an override of automatic format identification.
MANUAL
}

...

Panel

public class Bitstream extends DSpaceObject
{
// Returns the "confidence" recorded when the format of this Bitstream was identified.
public FormatConfidence getFormatConfidence()
//
// Sets the value returned by getFormatConfidence()
public void setFormatConfidence(FormatConfidence value)
//
// Returns the source that identified the format of this Bitstream.
public String getFormatSource()
//
// Sets the value returned by getFormatSource()
public void setFormatSource(String value)
}

...

BitstreamFormat

The <tt>BitstreamFormat</tt> BitstreamFormat class, which we will abbreviate as BSF, is essentially gutted and replaced with a new implementation. As described above, it now serves as a local "cache" of technical metadata that comes from external data format registries. Every BSF is bound to at least one format identifier in an external registry so its format technical metadata can be expressed in a way that is recognized outside of DSpace.

...

In the current (version 1.4.x) codebase, the <tt>BitstreamFormat</tt> BitstreamFormat object has acquired uses and meanings beyond simply describing a Bitstream's data format – but these interfere with intended purpose in preservation activities. For example, the BSF has an internal flag which directs the UI to hide Bitstreams of that format from casual view. In an unmodified DSpace installation, the <tt>"License"</tt> BSF is the only one for which internal is true, to keep deposit license files from appearing in the Web UI. Unfortunately, this usage cripples <tt>"License"</tt> as an actual format descriptor, since it gets applied to all sorts of Bitstreams that contain licensing information no matter what their actual format. XML, plain-text, and RDF files are all tagged with the "License" BSF to make them disappear, yet it says nothing about the format of their contents.

...

This is the new formal definition of a <tt>BitstreamFormat</tt> BitstreamFormat:

  • Each BSF represents a description of a single, unique data format; there is exactly one BSF for each distinct data format referenced by Bitstreams in the DSpace archive.
  • A BSF is bound to one or more entries in external data format registries.** The identifiers are logically all peers, although the metadata cached in the BSF is only imported (or updated) from one of them.
    • All external format identifiers which describe the equivalent format must be bound to the same DSpace BSF – in other words, there should never be two BSFs describing the same conceptual format, such as "PDF Version 1.2"; one BSF encompasses all synonym external identifiers.
  • The BSF's function is to describe the data format of the contents of a Bitstream, and nothing more.
    • Application code must not "overload" a BSF with additional implicit meanings, such as marking Bitstreams invisible in a UI or indicating a function such as the deposit license.
  • One special BSF, the unknown format, represents the unknown or unidentified data format.
  • Every Bitstream refers to exactly one BSF:
    • If its format has not been assigned or identified, it is the unknown format.
    • This allows an application to assume every Bitstream has a valid BSF with all of its attendant properties, so e.g. it can get a valid MIME type.

In the content model, a <tt>BitstreamFormat</tt> BitstreamFormat aggregates all the format-related technical metadata for Bitstreams of its type. Not only does this save space, it lets an administrator make changes and adjustments to that metadata easily in one place.

...

This new implementation makes the <tt>BitstreamFormat</tt> BitstreamFormat a local cache for the relevant format metadata, but mainly it acts as a reference to the full technical metadata found in one or more external format registries (like the GDFR). It only caches the metadata immediately needed by DSpace, such as MIME type, name, description. This is adequate for everyday operation of the archive; DSpace never has to go to the external format registry for metadata.

...

The standard namespace values are available as public static fields on the <tt>FormatRegistryManager</tt> FormatRegistryManager class. The LOC namespace is not really a registry yet but it makes sense to reserve the namespace since it is a significant source of format technical metadata.

...

Note that MIME types cannot be BSF identifiers because they violate the rule that only one BSF may be bound to each identifier. MIME types are imprecise, many BSFs have the same MIME type; e.g. a lot of XML-based are tagged <tt>"text/xml"</tt>.

Apple UTIs

Apple Computer has developed what is essentially an alternative to MIME types called Uniform Type Identifiers (UTIs). It is an interesting development, although not directly relevant. Although the UTI database is, in a sense, a registry of format identifiers, it is not a good candidate for use in DSpace for several reasons:

...

Removing the "Internal" Flag

This renovation removes <tt>BitstreamFormat</tt> BitstreamFormat's "internal" flag, which was originally intended to guide UI applications in hiding certain classifications of Bitstreams.

...

Its presence is actually harmful: not only does it have nothing to do with describing the format of the data, it actually encouraged usage that obscures the data format. The one <tt>"License"</tt> BSF was applied to all Bitstreams containing an Item's deposit license and Creative Commons licenses, no matter what their actual data formats. The Creative Commons license consists of three Bitstreams of distinct actual formats – e.g. one is RDF. It is misnamed with the <tt>"License"</tt> format so it will not be properly preserved.

In the DSpace@MIT registry, I have determined that <tt>"License"</tt> is the only <tt>BitstreamFormat</tt> BitstreamFormat for which "internal" is true, and that Bitstreams whose format is "License" appear only in Bundles named "LICENSE" and "CC_LICENSE". Therefore, we can determine the internal-ness (i.e. invisibility) of a Bitstream by its its owning Bundle as accurately as by the bogus BSF. It makes no practical difference which technique is used, but the Bundle-name cue is a better fit with the current content model. It works just as well under the DSpace+2.0 content model, since bundles evolve into "manifestations".

...

Panel
borderColor#ccc
bgColor#fff
titleBitstreamFormat Properties
borderStyledashed

Property

Source

Mod

Description

Name

Registry, Can override

Yes

Brief, human-readable description of this format, for listing it in menus.
Used to be "short description".

Description

Registry, Can override

Yes

Detailed human-readable explanation of the format including its unique aspects.

Identifier

Registry

No

List of all namespaced identifiers linking this BSF to an entry in an
established data format registry. A BSF must have at least one identifier.
This list is ordered; the first member names the external registry entry
that was originally imported to create this BSF.

Support-level

User-entered

Yes

Encoding of the local DSpace archive's policy regarding preservation of Bitstreams encoded in this format. Value must be one of:

  1. Unset - Policy not yet initialized, flags format entries that need attention from the DSpace administrator.
  2. Unrecognized - Format cannot be identified.
  3. Known - Format was identified but preservation services are not promised.
  4. Supported - Bitstream will be preserved.

MIME-type

Registry, Can override

Yes

Canonical MIME type (Internet data type) that describes this format. This is where the <tt>ContentContent-Type</tt> Type header's value comes from, when delivering a Bitstream by HTTP.

Extension

Registry, Can override

Yes

The canonical filename extension to apply to unnamed Bitstreams when
delivering content over HTTP and in DIPs. (NOTE: Some format
registries have a list of filename extensions is, used to help
identify formats, but we only need the canonical extension in the BSF model.

LastUpdated

System

No

Timestamp when this BSF was imported or last updated from its home registry.

...

BitstreamFormat API

Here is the new API for <tt>BitstreamFormat</tt> BitstreamFormat.

Add the following methods:

...

Panel

Wiki Markup
static BitstreamFormat create(Context context);
void delete();
static BitstreamFormat find(Context context, int id);
static BitstreamFormat findUnknown(Context context);
static BitstreamFormat\[\] findAll(Context context);
String getDescription();
int getID()
String getMIMEType();
int getSupportLevel();
static int getSupportLevelID(String slevel);
void setDescription(String s);
void setMIMEType(String s);
void setSupportLevel(int sl);
void update();

...

FormatRegistryManager

Following DSpace coding conventions, the factory and static class for a service is named with the suffix <tt>-Manager</tt>Manager. The <tt>FormatRegistryManager</tt> FormatRegistryManager class gives access to instances of <tt>FormatRegistry</tt> FormatRegistry. Since a format identifier is directed to a <tt>FormatRegistry</tt> FormatRegistry implementation by its namespace, the Manager also takes care of selecting the right instance for a namespaced identifier. This lets applications use namespaced identifiers without worrying about taking them apart to choose a registry instance.

Since all of the <tt>FormatRegistryManager</tt> FormatRegistryManager's state is effectively managed by the Plugin Manager, it does not need any state itself and only has static methods.

...

Panel

Wiki Markup
public class FormatRegistryManager
\{
// Namespaces for internal format registry - contains only "Unknown"
public static final String INTERNAL_NAMESPACE = "Internal";
//
// Name of the unknown format:
public static final String UNKNOWN_FORMAT_IDENTIFIER = "Unknown";
//
// Applications should use this as default mime-type.
public static final String DEFAULT_MIME_TYPE = "application/octet-stream";
//
// returns possibly-localized human-readable name of Unknown format.
public static String getUnknownFormatName(Context context);
//
// Returns registry plugin for external format identifier namespace
public static FormatRegistry find(String namespace);
//
// Returns array of *all* Namespace strings, even "artifacts" no longer configured.
public static String\[\] getAllNamespaces(Context context)
throws SQLException, AuthorizeException
//
// Returns array of all currently Namespaces of external registries.
public static String\[\] getRegistryNamespaces();
//
// Calls apropriate registry plugin to import format bound to a namespaced identifier.
// Returns null on error.
public static BitstreamFormat importExternalFormat(Context context, String namespace, String identifier)
throws FormatRegistryException, AuthorizeException
//
// Calls apropriate registry plugin to update format bound to a namespaced identifier.
// When force is true, update even when external format has not been modified.
public static void updateBitstreamFormat(Context context, BitstreamFormat existing, String namespace, String identifier, boolean force)
throws FormatRegistryException, AuthorizeException
//
// Calls apropriate registry plugin to compare two namespaced
// identifies (which must be in the same namespace).
public static boolean conformsTo(String nsIdent1, String nsIdent2)
throws FormatRegistryException
//
// Creates a namespaced identifier out of separate namespace and registry-specific identifier.
public static String makeIdentifier(String namespace, String identifier);
//
// Returns the namespace or identifier portion of a namespaced identifier.
public static String namespaceOf(String nsIdentifier)
public static String identifierOf(String nsIdentifier)
\}

...

FormatRegistry

The <tt>FormatRegistry</tt> FormatRegistry interface models an external data format registry. We define data format registry as any formally organized and administered collection of technical metadata about data formats. This may include a collection published mainly for human consumption such as the Library of Congress Sustainability of Digital Formats format catalog, as well as those accessible through public APIs such as the GDFR and DROID. The only requirement is that the data formats are named by unchanging, unique identifiers.

A format registry becomes available to DSpace through a Named Plugin implementing the <tt>FormatRegistry</tt> FormatRegistry interface.

A registry has these functions, explained in detail below:

  • Resolve a reference to a format identifier, and create a new <tt>BitstreamFormat</tt> BitstreamFormat representing the external format.
  • Update the local cache of metadata in a <tt>BitstreamFormat</tt> BitstreamFormat to the current state of the external registry.
  • Answer "conformance" queries - judge whether a Bitstream in one format would also conform to another format (i.e. former is a subtype of latter).

...

Format registry implementations are tightly coupled with DSpace. By this we mean they must be able to respond to frequent queries quickly and with low latency, and high reliability. The format registry must be available to complete some common operations such as ingestion and selection of applications like <tt>MediaFilter</tt> MediaFilter.

This is only likely to be an issue at all with registries that are attached through the network. Registries that exist in local data files or RDBMS tables can share server resources with the DSpace archive.

Network-based registries might use a local "cache" server sharing the DSpace host to increase reliability. The GDFR architecture explictly encourages this sort of configuration. Otherwise, it might be necessary for the <tt>FormatRegistry</tt> FormatRegistry implementation to add caching of its own to increase performance.

...

Although some of the tools to automatically identify formats are tied to format registries, this registry interface does not have anything to do with format identification. The identification tools are accessed through a separate plugin interface, discussed below.

...

FormatRegistry API

Here is the API of the <tt>FormatRegistry</tt> FormatRegistry. The plugin's name is also the DSpace string value representing its namespace. It is implemented as a self-named plugin, so that the instance itself knows its namespace without depending on each DSpace administrator to get it right. The namespaces must be consistent between DSpace installations so that format technical metadata (i.e. PREMIS elements in AIPs) can be meaningfully exchanged.

...

The unknown BSF is installed with the system, but for consistency, it is also derived from an entry in an "external" format registry. Since it is the only BSF which is absolutely mandatory, this registry must always be available, so it is a hard-coded registry that is always configured.

The <tt>FormatRegistryManager</tt> FormatRegistryManager maps the namespace <tt>DSpaceDSpace-Internal</tt> Internal to a special registry object which only recognizes the "Unknown" format identifier. The first reference to that format identifier, e.g. by the method <tt>BitstreamFormatBitstreamFormat.findUnknown()</tt>, "imports" it to create the unknown format BSF.

...

To add entries to the Provisional format registry, the DSpace administrator edits its configuration file (in a documented XML format similar to the current <tt>bitstreambitstream-formats.xml</tt> xml initialization file) and restarts any relevant DSpace processes. Since changes should be very infrequent this should not be a burden.

...

The "DSpace" registry includes most of the traditional, loosely-defined, format names, like <tt>"Text", "Adobe PDF", "HTML"</tt>. It offers a simple solution for DSpace administrators who do not need precise and detailed format identification, nor the digital preservation tools that require it. Since it includes most of the formats from previous DSpace releases under their same names, it also gives a degree of backward-compatibility.

...

As soon as a new format does become available in some external registry, you can add the new external identifier to its <tt>BitstreamFormat</tt> BitstreamFormat, perhaps updating the BSF's local metadata from its external registry.

Ideally, you will only employ Provisional formats when there will eventually be an entry in a globally-recognized registry for the format. For example, if you are adding a format to the GDFR but need to apply it to a Bitstream immediately, before the GDFR editorial process accepts it, you could create it in the Provisional registry to have it available immediately. Later, once the GDFR has an entry for it, add the GDFR identifier to the <tt>BitstreamFormat</tt> BitstreamFormat you already created. Then, DIPs of objects in that format will bear the GDFR format identifier that is recognizable to other archives, and your Bitstreams will also have linkage to any preservation metadata in the GDFR.

...

The framework is a common API to which format identification services conform. This lets DSpace treat them as a "stack" of plugin implementations, trying each one in turn and choosing the best of all their results. The API consists of:

  • FormatHit<tt>FormatHit</tt>
    A class encapsulating the information returned by a single potential format identification "hit"
  • FormatConfidence<tt>FormatConfidence</tt>
    A set of enumerated values that quantifies the "confidence" (certainty, accuracy) of format identifications for comparison on a common basis.
  • FormatIdentifier<tt>FormatIdentifier</tt>
    Interface of the plugin class that actually identifies formats.
  • FormatIdentifierManager<tt>FormatIdentifierManager</tt>
    Static class to operate the plugin stack and return a format identification verdict.

The

...

FormatHit Object

A <tt>FormatHit</tt> FormatHit is a record of the results of one format-identification match. It contains the following fields:

...

Each attempt at automatic identification of a Bitstream's format returns a Collection of <tt>FormatHit</tt> FormatHit objects, representing the possible matches. The list is sorted by accuracy and confidence of hit.

Confidence of Format Identification Hits

The <tt>FormatHit</tt> FormatHit includes a confidence metric, which represents the accuracy and certainty of the identification. It is an enumerated type of ordered, symbolic values implemented as a Java 5 enumeration.

The specific values are described above, under the description of the <tt>FormatConfidence</tt> FormatConfidence object.

<tt>FormatHit</tt> FormatHit includes a confidence rating so hits can be compared on the basis of confidence, and so it can be stored in the <tt>Bitstream</tt> Bitstream object whose format was identified.

The confidence values have a greater range and granularity than seems possible given DSpace's simple format model; i.e. DSpace does not distinguish betwen "generic" and "specific" formats. However, the actual automatic format identification is done by plugin implementations, some of which are driven by external format registries. These have access to more sophisticated format models and data, including notions of format granularity, so the confidence metrics reflect that.

...

FormatIdentifier Interface

Automatic format identification is accomplished by plugins implementing the <tt>FormatIdentifier</tt> FormatIdentifier interface. Each plugin applies its own technique toward identifying the format of the Bitstream. There is no direct relationship between external data format registries and format identifying plugins: a single plugin can utilize several registries or none, and different plugins can use the same external registry.

Note that the <tt>FormatHit</tt> FormatHit returned by the identification process contains an external format identifier, not a <tt>BitstreamFormat</tt> BitstreamFormat. The archive administrator is responsible for ensuring that all external format identifiers returned by automatic identification methods can be imported, i.e. that the relevant registries are configured.

...

Panel

package org.dspace.content.format;
public interface FormatIdentifier
{
// identify format of "target", add hits to "results"
public List<FormatHit> identifyFormat(Context context, Bitstream target, List<FormatHit> results)
throws FormatIdentifierException, AuthorizeException;
}

The <tt>identifyFormatidentifyFormat()</tt> method attempts to identify the data format of the given Bitstream, and delivers its results by adding a new <tt>FormatHit</tt> FormatHit at the appropriate point on the results list. It returns the resulting list (possibly either modified or replaced) as its value; the caller must anticipate that it may be a different Object.

An identifier method can add results to anyplace in the result list, or use the default algorithm implemented by <tt>FormatHitFormatHit.addToResults()</tt>. It is described in the next section, Implementing Automatic Format Identification.

The identifer method should throw <tt>FormatIdentifierException</tt> FormatIdentifierException when it encounters a fatal error that prevents it from properly identifying the format of the Bitstream. Otherwise, there would be no way to tell the difference between a Bitstream that does not match any of the formats this method identifies, and a fatal error in the identifying code (e.g. configuration problem), since the results list is simply returned unmodified in both cases.

When the identifier cannot return valid results because of a temporary condition that may be cleared up later – e.g. a network resource that is temporarily unavailable – it can throw the

Code Block
FormatIdentifierTemporaryException

{{FormatIdentifierTemporaryException}}to indicate that the results may change in the future.

A note on the object lifecycle: One instance of <tt>FormatIdentifier</tt> FormatIdentifier is created per JVM; it gets cached and reused. The <tt>identifyFormatidentifyFormat()</tt> method is assumed to be thread-safe. If it is not, the implementing class should have it call an internal method which is synchronized on itself.

Typically, only the internal <tt>FormatIdentifierManager</tt> FormatIdentifierManager code ever calls identification methods.

...

FormatIdentifierManager

This is a static class to operate the plugin stack and return a format identification verdict. Applications use it instead of calling the <tt>FormatIdentifier</tt> FormatIdentifier plugin stack directly.

Panel

package org.dspace.content.format;
public class FormatIdentifierManager
{
// identify all formats matching "target", returning raw hit list.
public static List<FormatHit> identifyAllFormats(Context context, Bitstream target)
throws AuthorizeException
//
// identify format of "target", returning best hit (never null).
public static FormatHit identifyFormat(Context context, Bitstream target)
throws AuthorizeException
//
// identify format of "target", AND set results in the Bitstream
public static void identifyAndSetFormat(Context context, Bitstream target)
throws SQLException, AuthorizeException, FormatRegistryException
}

The <tt>identifyFormat</tt> identifyFormat method always returns a hit. If the Bitstream was not successfully identified, it makes up a hit containing the unknown format.

...

The format identification framework is based on a sequence plugin, which gives administrators complete freedom to add and rearrange identification methods.

The <tt>FormatIdentifierFormatIdentifier.identifyFormat()</tt> method is very powerful; it actually controls the entire process of automatic format identification, even though it is called from deep within the framework. The <tt>FormatIdentifierManager</tt> FormatIdentifierManager only calls the stack of identifier methods in order and collects the results they provide. Each DSpace administrator has complete control of the methods run and the order of their execution, and the methods determine the results. The format identification API was designed to be very flexible, and also to make it easy to implement new identification methods.

Each implementation of <tt>FormatIdentifierFormatIdentifier.identifyFormat()</tt> can do whatever it wants with the Bitstream and list of results it is given. It might be a "filter" method that prunes the results of any below a certain level of confidence. It could look at other results and try to refine them, or reorder them.

...

Note that some plugins may depend on other identification methods running before they do because they refine an identification already found on the results list. Special relationships like that must be well documented so the administrator is aware of them.

Each <tt>FormatIdentifier</tt> FormatIdentifier plugin applies its special knowledge or resources to attempt to identify the format of the Bitstream; it is not responsible for solving the whole problem. For example, take a plugin that executes a heuristic to detect comma-separated-values files. It might collaborate with another method that detects plain-text files, so that it only applies its algorithm to refine the format identification if it sees from the results that the file is plain text.

...

One problem that has not yet been completely addressed by this design is that many format-identification methods require random access to the contents of a Bitstream, but the Bitstream API only offers serial access through a Java <tt>InputStream</tt> InputStream. Random access means reading a sequence of bytes from the Bitstream starting at any point in its extent; this is very helpful when looking for an internal signature to identify the file, since the signature may be located relative to the end of the file or at some larged offset into it.

There are techniques to compensate for the lack of random access, although they sacrifice efficiency. It may also be necessary to add a method to <tt>Bitstream</tt> Bitstream to retrieve a random-access stream when the underlying storage implementation supports it.

...

Follow these steps when comparing two format identification hits to determine which has priority. This is implemented as the method <tt>FormatHitFormatHit.compareTo()</tt>.

  1. If the Namespace of the identifier cannot be resolved (looked up), i.e. because there is no FormatRegistry configured for it, that hit loses. See <tt>FormatRegistryManagerFormatRegistryManager.find()</tt>.
  2. Order hits by their FormatConfidence index. Thus, hits based on the content of the Bitstream rate more highly than ones based on external attributes like the name.
  3. Between two equal hits, if one has a non-null warning it is ranked lower.
  4. When a hit has a conflict (that is, there is a lower-ranked hit which disagrees with it because the MIME type is different or similar), it is ranked below hits of the same confidence.

...

This is the default algorithm that is implemented by <tt>FormatIdentifierFormatIdentifier.identifyFormat()</tt> methods that simply call the <tt>FormatHit</tt> FormatHit's <tt>addToResultsaddToResults()</tt> method on each hit they develop.

...

  1. Start with an empty results list.
  2. Call the <tt>FormatIdentifierFormatIdentifier.identifyFormat()</tt> method of each plugin in the sequence in turn:
    • Passing it the Bitstream and list of accumulated results so it can add new results.
    • If it has a better-confidence match than the current head of the list, that hit becomes the new head of the list.
    • Otherwise the hit gets appended to the end of the list.
  3. When finished, the head of the list is the best format match.

To select a <tt>BitstreamFormat</tt> BitstreamFormat from the results, follow these steps:

  1. Starting with the first result, take the first format identifier and namespace that can successfully be resolved into a <tt>BitstreamFormat</tt> BitstreamFormat (importing a new one if necssary).
  2. If no <tt>BitstreamFormat</tt> BitstreamFormat is available, result is the unknown one, and set the confidence to <tt>UNIDENTIFIED</tt> UNIDENTIFIED.

This logic is encapsulated in the <tt>FormatIdentifierManagerFormatIdentifierManager.identifyFormat()</tt> method.

If an application wants to generate a dialog showing all of the results of an automatic format identification (e.g. to give an interactive user the chance to second-guess the automatically-chosen format) it could call the plugins and process the results according to the algorithm above. We don't anticipate anyone wanting such a service, but if it comes up, we can always add another method to <tt>FormatIdentifierManager</tt> FormatIdentifierManager.

Applying the results to a

...

Bitstream

The properties of a <tt>Bitstream</tt> Bitstream describing its format and the confidence of its identification have analogues within the <tt>FormatHit</tt> FormatHit structure. The logic to map between them is encapsulated within the <tt>FormatHitFormatHit.applyToBitstream()</tt> method, so there is only one piece of code to update if either of those objects changes in the future.

Implementing a

...

FormatIdentifier plugin

As mentioned before, each <tt>FormatIdentifier</tt> FormatIdentifier implementation only has to do part of the job, so it can be very narrowly focused. It can also look at the results of previous methods to decide if it has anything to add to the overall solution. For example, a method that heuristically identifies text-based formats would only proceed if it saw that a previous method had identified the data
as generic plain text.

Each method may add several <tt>FormatHits</tt> {{FormatHit}}s to the result list, or none at all.

...

  • A signature-based format identifier notices that the file starts with <tt>"<?xml"</tt> and adds a hit for the generic format "XML", and MIME type <tt>texttext/xml</tt>xml.
  • The text heuristic identifier methods don't bother running because there is already a higher-confidence positive identification (of the generic XML format).
  • A table-driven specific XML identifier method notices the hit for the generic XML format, and parses enough of the file to match one of the XPath specifications in its configuration. This identifies the file as an IMS Content Package manifest, MIT OCW version 1 profile.
    • If the XML parse had failed, the plugin could add a warning to the generic "XML" hit since it was obviously not well-formed XML.
  • Results include the "OCW-IMSCP" format first, "XML" next, and perhaps other generic hits after it.

...

A conflict arises when the automatic identification process returns hits for incompatible formats. This is commonly caused by contradictory clues in a Bitstream, for example, a filename extension that a different format than the one indicated by internal signature matches. Consider a Bitstream containing a well-formed XML document; the "internal signature" method correctly identifies it as XML. However, its name ends with <tt>".txt"</tt> which is only listed as an external signature for other kinds of formats.

...

  1. Displaying a human-readable description of the format to the end-user.
  2. Choosing from among all available data formats:
    • Just the formats in the <tt>BitstreamFormat</tt> BitstreamFormat table.
    • All formats available in selected external registries.
  3. Selecting a <tt>BitstreamFormat</tt> BitstreamFormat from among the results of an automatic format identification, with the option of choosing freely as in Case #2.
  4. Administrative interface to add and update <tt>BitstreamFormat</tt> BitstreamFormat objects.
  5. Administrative interface modify chosen format of <tt>Bitstream</tt> Bitstream (existing UI can be used with minimal changes).

...

The UI needs to display the data format of a Bitstream to the user in a meaningful way. Historically, this has been accomplished with the <tt>name</tt> name (formerly "short description") property of the BSF, which is a short human-readable label such as "Adobe PDF 1.2". In some contexts it may be helpful to also cite the confidence property of the Bitstream to indicate how the format was discovered so the user can tell how much to trust it.

...

First, what range of formats do you want to be able to choose from?

  1. All <tt>BitstreamFormat</tt> BitstreamFormat names? This typically includes only formats of objects that have already been imported into the archive.
  2. All of the formats in a given external registry, or set of them? This is likely to be more complete, but brings with it the problem of handling a very large list, perhaps thousands of choices.

In the first case, the problem is easily solved, but not so useful in most applications; if you are choosing a new format for a Bitstream, why limit your choice only to formats of Bitstreams already in the archive? It is mostly useful when your purpose is to choose among <tt>BitstreamFormat</tt>s{{BitstreamFormat}}s, e.g. picking one to edit.

...

Given the complexity of implementing solutions for the cross product of format registries and DSpace UIs, we think it is more productive to let each UI negotiate with the format registry of its choice to produce a navigible display of formats. For example, the UI can transfer control to a popup or dialog encapsulating the registry's UI or a registry-specific extension. All it has to do is return a format identifier that can be resolved or imported to a <tt>BitstreamFormat</tt> BitstreamFormat.

The alternative is to force all registries into a common model, which would probably deprive them of the metadata most helpful to generating a good navigation interface. Each registry has unique features in its data model to facilitate browsing.

...

  • Indicate default choice of format and its MIME type.
  • Ordering is significant, hits closer to the head of the list take precedence.
  • The confidence metric on each hit is highly significant in helping the user evaluate them, so include it prominently and in a way that makes its values easy for naive users to understand.
  • Offer an "escape route" to choose a format from all available formats. This becomes the default if automatic format identification failed.

Editing

...

BitstreamFormat Table

See the next section for details, this is classed as an administrative operation.

Administrative Operations Relating to

...

{{BitstreamFormat}}s

The DSpace administrator manages data formats with these operations:

...

The external registries are chosen by adding their names to a plugin interface as shown here. Note that the plugin name, which is also the namespace it covers, gets supplied by plugin itself through the <tt>getPluginNamesgetPluginNames()</tt> method. The order is not significant. This configuration example includes registries implementing the PRONOM, DSpace, and Provisional namespaces (guessing from the classnames).

...

The report includes, for each <tt>BitstreamFormat</tt> BitstreamFormat in use,

  • BSF's name
  • External identifier(s)
  • Count of Bitstreams referring to it.

Maintenance

Edit

...

BitstreamFormat Metadata

Some administrators will undoubtedly have a need to make local customizations to the descriptive and technical metadata for data formats. These attributes of a <tt>BitstreamFormat</tt> BitstreamFormat may all be customized by overriding the values imported from the remote registry – and the overrides persist even when the BSF is updated from its external registry.

...

A timestamp of last update is maintained for all BSFs. When performing a group update, use the timestamp farthest in the past as the limit when searching for changed formats in the remote registry. After a group update, set the time of last update of all relevant BSFs to the time of this operation.

Edit

...

Bitstream Technical Metadata

Here are the cases where a Bitstream's format technical metadata must be modified:

...

Manually (interactively) force the choice of a new data format, chosen from either:

  • The existing set of <tt>BitstreamFormat</tt> BitstreamFormat entries.
  • An explicitly specified namespace and identifier referencing an external format, which is imported if necessary.
    • This may be a simple text-entry box since it doesn't have to be user-friendly.

...