Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Remove line breaks which defeated reflow, and some spelling and stuff

This page proposes a set of changes
to improve the representation of file formats in the content model, in
order to better support preservation activities.
The design applies to DSpace 1.5, or later.
This work is to be done
as part of the
FACADE project but it is designed and intended as
a generally useful extension to the platform.

Please use the Discussion tab
for your comments on and reactions to this proposal, since comments
mingled with this much text would be too hard to find.

...

In <cite>Automatic Format Identification using PRONOM and DROID</cite>,
Adrian Brown defines a "data format" as:

...

Note that this implies more than just knowing the common name of a Bitstream's
format, e.g. "Adobe PDF". That name actually describes a family of formats.
In order to know exactly how to recover the
intelligence in a particular Bitstream, you'd want to know which specific version of PDF it is: later versions have features not found in earlier ones.
The "internal structure and encoding" imposed by a data format is
usually defined in exacting detail by a format specification document,
and/or by the software applications that produce and consume that format.

For additional, extensive, background, see
About Data Formats , which serves as a
manifesto of sorts for this project.

...

  • The way data formats are represented in the DSpace content model.
  • Clarification and rationalization of the use of <tt>BitstreamFormat</tt> BitstreamFormat.
  • Mechanisms for identifying the data format of a <tt>Bitstream</tt> Bitstream.
  • Integration of standards-based technical metadata about data formats which can be effectively shared with other applications.

The original DSpace design intentionally avoided the issue of describing
data formats in such detail because there were already other efforts underway
to thoroughly catalog data formats – and DSpace would eventually leverage
their work. As of June, 2007, the most sophisticated
data format registries are still in development, but some usable systems are operating in production.
We propose
to integrate external data format intelligence through
a flexible plugin-based architecture to take advantage of what is
currently available but leave a clear path for future upgrades and changes.
It also lets each DSpace installation choose an appropriate level of
complexity and detail in their format support.

...

  • Enable accurate, meaningful, fine-grained, and globally-understood identification of a <tt>Bitstream</tt> Bitstream's data format.
  • Maintain backward compatibility with most existing code, and existing archives.
  • Introduce the binding of persistent, externally-assigned data format identifiers to <tt>BitstreamFormats</tt> BitstreamFormats.
  • Integrate tightly with "standard" data format registries, using a plugin framework for flexible configuration:
    • Anticipate that the Global Digital Format Registry (GDFR) will be the registry of choice, but allow free choice of other metadata sources.
    • Recognize references to entries in "standard" data format registries in ingested content (e.g. technical MD in SIPs) to facilitate exchange of SIPs and DIPs.
    • The DSpace data model directly includes only the subset of format metadata it has an immediate use for, and references entries in an external format registry for the rest.
    • Refer to formats by the external data format registry's identifiers so format technical metadata is recognized outside of DSpace.
  • Improve the automatic identification of data formats in batch and non-interactive content ingestion.
  • Help interactive users identify formats easily and with accuracy during interactive submission.
  • Rationalize use of <tt>BitstreamFormat</tt> BitstreamFormat object:
    • Eliminate the overloaded use of the "License" format and "Internal" flag in BitstreamFormats to mark and hide deposit license bitstreams.
    • Attempt to accurately describe the data format of every Bitstream, even the ones created for internal use.
  • Create pluggable interface to external data format registries, to encourage experimentation and track developments in this highly active field.
  • Add a separate pluggable format-identification interface to allow a "stack" of methods to identify the format of a Bitstream by various techniques.

...

See BitstreamFormat Renovation for the sketches of the
anticipated use cases that drove this design. The text grew too
large for one page.

Overview of Changes to DSpace Core and API

...

Each <tt>Bitstream</tt> still refers to a
<tt>BitstreamFormat</tt> object to identify its data format.
In addition, the <tt>Bitstream</tt> gains two new properties:

...

Although outwardly similar and largely backward-compatible, the
<tt>BitstreamFormat</tt> has been completely gutted and re-implemented.
It now serves as a local "cache" of format technical metadata and holds
one or more external format identifiers, each of which refers to a
complete technical metadata record in an external data format registry.

These external registries (as described below) are the authoritative
source of format technical and descriptive metadata about data formats.
This fundamental change lets DSpace take advantage of the extensive
work and recognized standards offered by external format registries
such as
PRONOM
and the
GDFR.

Core API

External Format Registry

We add a plugin interface to provide access to external data format
registries. Each registry is modeled as an
implementation of the <tt>FormatRegistry</tt> interface. It is
fairly simple; it only supports "importing" a format description
into the local <tt>BitstreamFormat</tt> cache, updating an existing format,
and a few queries.

A single DSpace archive may be configured to access many external format
registries. It usually will be, since no one registry currently has all
the answers.

Backward-compatibility is provided by a built-in "DSpace" registry
which contains all of the DSpace-1.4 formats.

...

The old <tt>org.dspace.content.FormatIdentifier</tt> is replaced by
a configurable, extensible, plugin-based format identification framework.
It is not part of the format registry plugin, because while some
format recognition services live in a registry's software suite,
others are independent of any registry.

Format identification is one of the most important improvements
in this project. It is explained in complete detail below.

...

There are additions to the configuration and administration APIs
to support configuration and maintenance of
the new registry and format-identification frameworks.

...

Here are detailed explanations of the planned changes to content model
classes.

Exceptions

The following exceptions are thrown when a fatal error is encountered
in the format registry and identification framework. They are similar
in meaning to existing exceptions in the DSpace API, such as
<tt>AuthorizeException</tt> – signalling a fatal error with
enough context and explanation to communicate the cause to the
user or administrator.

<tt>FormatRegistryException</tt>

Sent when there is a fatal error while accessing an external format
registry or updating the local cache of format metadata in the DSpace
RDBMS. Can be caused by incorrect or missing configuration entries,
network problems, filesystem problems, etc.

...

A subclass of <tt>FormatRegistryException</tt>, this exception
is sent in particular cases when looking up an external identifier
fails although it should have been found (e.g. since it had been
found before). In the common case of looking up an identifier for the
first time, e.g. through <tt>BitstreamFormat.findByIdentifier()</tt>,
no exception gets thrown because failure is a possibly-expected result.

...

Thrown when a format identification method encounters a fatal error which
would cause it to return a false negative result. For example, if its
configuration is missing or incorrect, the method throws this exception
rather than silently failing. Simply failing to identify a format is
in the realm of expected results and does not cause an exception.

...

Thrown by the format identification method when it fails because of
a "temporary" problem, e.g. when a network resource is not available.
This subclass of FormatIdentifierException tells the identifier manager
that it may succeed when retried later.

...

The most significant change is that the Bitstream now remembers the
confidence of its format identification, an enumerated value which
indicates the certainty and source of its format identification.
There is also a convenience method to access the automatic format
identification: since it almost always used to set the Bitstream's
format anyway, this improves code clarity.

...

The <tt>BitstreamFormat</tt> class, which we will abbreviate as BSF,
is essentially gutted and replaced with a new implementation.
As described above, it now serves as a local "cache" of technical
metadata that comes from external data format registries.
Every BSF is bound to at least one format identifier in an external registry
so its format technical metadata can be expressed in a way that is
recognized outside of DSpace.

...

In the current (version 1.4.x) codebase, the
<tt>BitstreamFormat</tt> object has acquired uses and meanings
beyond simply describing a Bitstream's data format – but these interfere
with intended purpose in preservation activities.
For example,
the BSF has an internal flag which directs the UI
to hide Bitstreams of that format
from casual view.
In an unmodified DSpace installation, the
<tt>"License"</tt> BSF
is the only one for which internal is true, to
keep deposit license files from appearing in the Web UI.
Unfortunately, this usage cripples <tt>"License"</tt> as an actual
format descriptor, since it gets applied to all sorts of Bitstreams
that contain licensing information no matter what their actual format.
XML, plain-text, and RDF files are all tagged with the "License" BSF
to make them disappear, yet it says nothing about the format of their contents.

We rationalize the function of the BSF so that it only describes the
data format of a Bitstream's contents.
There are other ways to mark Bitstreams as "internal", e.g.
Bundle membership, which is a better fit with the semantics of the
content model anyway.

Formal Definition

...

In the content model, a <tt>BitstreamFormat</tt> aggregates all
the format-related technical metadata for Bitstreams of its type.
Not only does this save space, it lets an administrator make changes and
adjustments to that metadata easily in one place.

It also holds DSpace-specific format metadata, which currently consists only
of the one administrative metadata item, the "support-level", which
controls the preservation
support policy for all Bitstreams of that format.

This new implementation makes the <tt>BitstreamFormat</tt>
a local cache for the relevant format metadata, but mainly it acts
as a reference to the full technical metadata found in
one or more external format registries (like
the GDFR).
It only caches the metadata immediately needed by DSpace, such as
MIME type, name, description. This is adequate for
everyday operation of the archive;
DSpace never has to go to the external format registry for metadata.

This organization makes it much easier to take advantage of developments
in the rapidly evolving field of data format technical metadata.
Instead of
trying to import all of the various schemas and data models of
of every external data format
registry, we can just maintain a reference into the external registry.

...

External data format identifiers are introduced in this architecture.
They link DSpace's internal BSF objects to entries in external format
registries such as the
GDFR.
Format identifiers consist of a namespace name, and
the identifier, an arbitrary string.
They possess these properties:

...

External identifiers are the key to naming data formats in a way that
can be recognized outside of the DSpace system; they
allow the BSF to be meaningfully crosswalked to
and from technical
metadata schemas such as
PREMIS.

The Namespace

We add a DSpace-specific namespace to the external format identifier to:

...

The namespace is a short string belonging to a
controlled vocabulary (represented by a table in the DSpace
configuration or the RDBMS). Each namespace describes a source of data
format information.

Initial Set of Well-Known Namespaces

Here are the suggested initial namespace names for existing registries.
This is not an exhaustive set; since the registry is a plugin,
a new registry can always be added as a DSpace add-on.

The DSpace-Internal, DSpace, and Provisional registries
are implemented in the core. We exepect a PRONOM registry
plugin to be available with the initial reference implementation.

...

The standard namespace values are available as public static fields
on the <tt>FormatRegistryManager</tt>
class. The LOC namespace
is not really a registry yet but it makes sense to reserve
the namespace since it is a significant source of format technical metadata.

MIME Types excluded

Note that
MIME types
cannot be BSF identifiers because they violate the rule
that only one BSF may be bound to each identifier.
MIME types are imprecise,
many BSFs have the same MIME type; e.g. a lot of XML-based
are tagged <tt>"text/xml"</tt>.

...

Apple Computer has developed what is essentially an alternative to MIME types
called
Uniform Type Identifiers (UTIs).
It is an interesting development, although not directly relevant.
Although the UTI database is, in a sense, a registry of format identifiers,
it is
not a good candidate for use in DSpace for several reasons:

...

This renovation removes <tt>BitstreamFormat</tt>'s "internal"
flag, which was originally intended to
guide UI applications in hiding certain classifications of Bitstreams.

Was the "Internal" flag ever truly necessary?
Apparently it was only ever used to
make the Bitstreams
containing deposit-license and Creative Commons license invisible in the
Web UI.

Its presence is actually harmful: not only does it
have nothing to do with describing the format of the data, it
actually encouraged usage that obscures
the data format.
The one <tt>"License"</tt> BSF was applied to all Bitstreams containing
an Item's deposit license and Creative Commons licenses, no matter
what their actual data formats. The Creative Commons license consists
of three Bitstreams of distinct actual formats – e.g. one is RDF.
It is misnamed with the <tt>"License"</tt> format so it will not be properly preserved.

In the DSpace@MIT
registry,
I have determined that <tt>"License"</tt> is the only
<tt>BitstreamFormat</tt> for which
"internal" is true, and that Bitstreams whose format is "License"
appear only in Bundles named "LICENSE" and "CC_LICENSE".
Therefore,
we can determine the internal-ness (i.e. invisibility) of a Bitstream by its
its owning Bundle as accurately as by the bogus BSF.
It makes no practical difference which technique is used, but
the Bundle-name cue is a better fit with the current content model.
It works just as well under the DSpace+2.0 content model, since
bundles evolve into "manifestations".

By modifying UI code to judge "visibility" of a Bitstream by the Bundle
it belongs to rather than a cryptic property of its format, we can
get rid of the "internal" bit without any user-visible changes. The
content model and API are improved, since license Bitstreams may now have
meaningful data formats assigned so they can be preserved and disseminated
correctly.

BitstreamFormat Properties

Here is a list of all the BSF's properties, i.e.
fields of the object.
Source is where the property originally comes from; Mod
means whether or not it may be changed by the archive administrator.

...

Following DSpace coding conventions, the factory and static class
for a service is named with the suffice suffix <tt>-Manager</tt>. The
<tt>FormatRegistryManager</tt> class give gives access to
instances of <tt>FormatRegistry</tt>. Since a format identifier
is directed to a <tt>FormatRegistry</tt> implementation by its namespace,
the Manager also takes care of selecting the right instance for a
namespaced identifier. This lets applications use namespaced identifiers
without worrying about taking them apart to choose a registry instance.

Since all of the <tt>FormatRegistryManager</tt>'s state is effectively managed
by the Plugin Manager, it does not need any state itself and only has
static methods.

API

Here is a sketch of the API:

...

The <tt>FormatRegistry</tt> interface models an external data format
registry.
We define data format registry as any formally organized and administered
collection of technical metadata about data formats.
This may include a collection published mainly for human consumption
such as the
Library of Congress Sustainability of Digital Formats
format catalog, as well as those accessible through public APIs such
as the
GDFR
and
DROID.
The only requirement is that the data formats are named by unchanging,
unique identifiers.

A format registry becomes available to DSpace through a Named Plugin
implementing the
<tt>FormatRegistry</tt> interface.

...

Format registry implementations are tightly coupled with DSpace.
By this we mean they must be able to respond to frequent queries quickly and
with low latency, and high reliability. The format registry must
be available to complete some common operations such as ingestion
and selection of applications like <tt>MediaFilter</tt>.

This is only likely to be an issue at all with registries that are
attached through the network. Registries that exist in local data files
or RDBMS tables can share server resources with the DSpace archive.

Network-based registries might use a local "cache" server sharing the
DSpace host to increase reliability. The
GDFR
architecture explictly encourages this sort of configuration.
Otherwise, it might be necessary for the <tt>FormatRegistry</tt>
implementation to add caching of its own to increase performance.

...

Although some of the tools to automatically identify formats are tied to
format registries, this registry interface does not have anything to do
with format identification. The identification tools are accessed through
a separate plugin interface, discussed below.

...

Here is the API of the <tt>FormatRegistry</tt>.
The plugin's name is also the DSpace string value representing its namespace.
It is implemented as a self-named plugin, so that the instance itself
knows its namespace without depending on each DSpace administrator to get
it right. The namespaces must be consistent between DSpace installations
so that format technical metadata (i.e. PREMIS elements in AIPs) can
be meaningfully exchanged.

...

Typically the name of the registry is bound to some well-known public
constant so it can be referred to in a program without a "magic string"
that is easily misspelled to disasterous disastrous effect. E.g.:

Panel

FormatRegistryManager.INTERNAL_NAMESPACE ... "Internal"
DSpaceFormatRegistry.NAMESPACE .... "DSpace"
ProvisionalFormatRegistry.NAMESPACE .... "Provisional"

...

Importing an entry from an external format registry creates a new
local BSF. In order to create a new BSF, there must not be an existing
BSF already bound to that entry's identifier (or any identifiers it
lists as synonyms).

So, first check for a BSF bound to any of the namespaced identifiers
attached to the external registry entry. If one is found, add the
new identifier to that BSF and return it instead of creating a new one.

If no existing BSF is found, create a new one, and
initialize its properties
from appropriate values of registry entry's technical metadata:

...

Binding all equivalent (synonym) identifiers in the remote entry
ensures that a scenario like this does not create extra local BSFs:

...

This describes the way a single BSF is updated from its corresponding
entry in an external data format registry – the choice of which BSF
to update and which registry to query are separate issues covered in
the section on administrative actions.

...

Supplied with the identifiers of two entries in the registry, this
predicate function returns true if the the first format conforms to the second. That means, any Bitstream identified as the
first format would pass the tests to be identified as the second as well.
For example, if the first format is a specific version of a format while
the second identifier names a format family which includes it,
conformsTo would be true.

The registry plugin should implement this operation efficiently, since it
may be called many times, e.g. when choosing applications to match the
format of a Bitstream.

The DSpace-Internal "Registry"

The unknown BSF is installed with the system, but for consistency, it
is also derived from an
entry in an "external" format registry. Since it is the only
BSF which is absolutely mandatory, this registry must always be available,
so it is a hard-coded registry that is always configured.

The <tt>FormatRegistryManager</tt> maps the namespace
<tt>DSpace-Internal</tt> to a special registry object
which only recognizes the "Unknown" format identifier.
The first reference to that format identifier, e.g. by the method
<tt>BitstreamFormat.findUnknown()</tt>,
"imports" it to create the unknown format BSF.

Although it might appear that the "Unknown" format really belongs
in the DSpace namespace and registry, that would force the
DSpace registry to be configured all the time.
Putting "Unknown" in a separate
built-in registry lets the administrator remove the DSpace
registry from the configuration if she wishes to.

...

The initial implementation also includes built-in format registries for the
DSpace and Provisional registry namespaces.
Unlike the DSpace-Internal registry, they are optional.
By itself, the
DSpace registry reproduces the release 1.4.x behavior to offer
the option of backward-compatibility.
The Provisional registry offers a separate place to put formats local
to the archive, safe from namespace collisions and future updates
to the DSpace registry. (It is not always the recommended way to handle new formats, more on this later.)

These two registries share an implementation class since their operation
is exactly the same. The plugin manager creates one instance for each
registry (i.e. namespace). They use the plugin/namespace name to select
a configuration file. The registry's contents come from that file,
which is read on startup.

To add entries to the Provisional format registry, the DSpace administrator
edits its configuration file (in a documented XML format similar to the
current <tt>bitstream-formats.xml</tt> initialization file) and restarts
any relevant DSpace processes. Since changes should be very infrequent
this should not be a burden.

...

The "DSpace" registry includes most of the traditional,
loosely-defined, format names, like <tt>"Text", "Adobe PDF", "HTML"</tt>.
It offers
a simple solution for DSpace
administrators who do not need precise and detailed format
identification, nor the digital preservation tools that require it.
Since it includes most of the formats from previous DSpace
releases under their same
names, it also gives a degree of backward-compatibility.

It is not necessary to include this registry in the configuration.
It can be left out
if, e.g., the administrator only wants to use PRONOM formats.

The contents of the "DSpace" registry are controlled by DSpace
source code releases and must not be altered locally. See the
next section to add formats to your archive.

...

The Provisional format registry lets a DSpace administrator add data
formats which are not available in any other external registry to her
DSpace archive. The contents of the "Provisional" registry are strictly under
the control of the administrator. It starts out empty.

Using formats from the Provisional namespace carries some risks:
the format identifiers
are meaningless (and useless for preservation) outside of
their own DSpace archive. Even another DSpace might not have the same
Provisional formats configured. Of course, a Provisional format should only
be added when it is not available in any shared registry, anyway.

As soon as a new format does become available
in some external registry, you can add the new external identifier to its
<tt>BitstreamFormat</tt>, perhaps updating the BSF's local metadata from
its external registry.

Ideally, you will only employ Provisional formats when there will eventually
be an entry in a globally-recognized registry for the format.
For example,
if you are adding a format to the GDFR but need to apply it to a Bitstream
immediately,
before the GDFR editorial process accepts it, you could create it
in the Provisional registry to have it available immediately.
Later, once the
GDFR has an entry for it, add the GDFR identifier to
the <tt>BitstreamFormat</tt> you already created.
Then, DIPs of objects in that format will bear the GDFR format identifier
that is recognizable to other archives, and your Bitstreams will also
have linkage to any preservation metadata in the GDFR.

This registry is implemented the same way as the "DSpace"
registry, reading format information from an XML document that lists
all the "Provisional" format, at startup.
However, its identifiers occupy a separate namespace so there is no chance
of collisions with the data formats provided by the DSpace release.

...

Experience has shown that even the most knowledgeable submitters
rarely understand or care about identifying the data formats of
materials they upload.
Also, many submissions are done in batch and
non-interactive transactions where human intervention is not possible.
Thus, we promote automatic format identification as the primary method
of assigning formats to Bitstreams, and strive to make it accurate,
reliable and efficient.

We propose a configurable and extensible framework for
integrating external automatic format identification services.
This is the best approach because:

...

The framework is a common API to which format identification services
conform. This lets DSpace treat them as a "stack" of plugin
implementations, trying each one in turn and choosing the best of
all their results. The API consists of:

...

A <tt>FormatHit</tt> is a record of the results of one
format-identification match.
It contains the following fields:

...

Each attempt at automatic identification of a Bitstream's format
returns a Collection of <tt>FormatHit</tt> objects, representing the
possible matches. The list is sorted by accuracy and confidence of hit.

...

The <tt>FormatHit</tt> includes a confidence metric, which represents
the accuracy and certainty of the
identification. It is an enumerated type of ordered, symbolic values
implemented as a Java 5
enumeration.

The specific values are described above, under the description of the
<tt>FormatConfidence</tt> object.

<tt>FormatHit</tt> includes a
confidence rating so hits can be compared on the basis of confidence, and so
it can be stored in the <tt>Bitstream</tt> object whose format was
identified.

The confidence values have a greater range and granularity than seems
possible given DSpace's simple format model; i.e. DSpace does not distinguish
betwen "generic" and "specific" formats. However, the actual automatic
format identification is done by plugin implementations, some of which are
driven by
external format registries. These have access to more sophisticated
format models and data, including notions of format granularity,
so the confidence metrics reflect that.

...

Automatic format identification is accomplished by plugins implementing
the <tt>FormatIdentifier</tt> interface.
Each plugin applies its own technique
toward identifying the format of the
Bitstream. There is no direct
relationship between external data format registries and format
identifying plugins: a single plugin can utilize several registries or
none, and different plugins can use the same external registry.

Note that the <tt>FormatHit</tt> returned by the identification process
contains an external format identifier, not a <tt>BitstreamFormat</tt>.
The archive administrator is responsible for ensuring that all
external format identifiers returned by automatic identification methods can
be imported, i.e. that the relevant registries are configured.

...

The <tt>identifyFormat()</tt> method attempts to identify the data
format of the given Bitstream, and delivers its results by adding a new
<tt>FormatHit</tt> at the appropriate point on
the results list.
It returns the resulting list (possibly either modified or replaced)
as its value; the caller must anticipate that it may be a different Object.

An identifier method can add results to anyplace in the result list, or
use the default algorithm implemented by <tt>FormatHit.addToResults()</tt>.
It is described in the next
section, Implementing Automatic Format Identification.

The identifer method should throw <tt>FormatIdentifierException</tt>
when it encounters a fatal error that prevents it from properly
identifying the format of the Bitstream. Otherwise, there would be no
way to tell the difference between a Bitstream that does not match any
of the formats this method identifysidentifies, and a fatal error in the
identifying code (e.g. configuration problem), since the results list
is simply returned unmodified in both cases.

When the identifier cannot return valid results because of a temporary
condition that may be cleared up later – e.g. a network resource that
is temporarily unavailable – it can throw the

Code Block
FormatIdentifierTemporaryException

to indicate that
the results may change in the future.

A note on the object lifecycle: One instance of <tt>FormatIdentifier</tt>
is created per JVM; it gets cached
and reused. The <tt>identifyFormat()</tt> method is assumed to be
thread-safe. If it is not, the implementing class should have it
call an internal method which is synchronized on itself.

Typically, only the internal <tt>FormatIdentifierManager</tt>
code ever calls identification methods.

...

This is a static class to operate the plugin stack and return a format
identification verdict. Applications use it instead of calling the
<tt>FormatIdentifier</tt> plugin stack directly.

...

The <tt>identifyFormat</tt> method always returns a hit. If the
Bitstream was not successfully identified, it makes up a hit containing
the unknown format.

Controlling the Format Identification Process

The format identification framework is based on a
sequence plugin, which
gives administrators complete freedom to
add and rearrange identification methods.

The <tt>FormatIdentifier.identifyFormat()</tt> method is very powerful;
it actually controls the entire process of automatic format
identification, even though it is called from deep within the framework.
The <tt>FormatIdentifierManager</tt> only calsl calls the stack of
identifier methods in order and collects the results they provide.
Each DSpace administrator has complete control of the methods run
and the order of their execution, and the methods determine the results.
The format identification API was designed to be very flexible, and also
to make it easy to implement new identification methods.

Each implementation of <tt>FormatIdentifier.identifyFormat()</tt>
can do whatever it wants with the Bitstream and list of results it is
given. It might be a "filter" method that prunes the results of
any below a certain level of confidence. It could look at other results and
try to refine them, or reorder them.

An archive administrator can even insert a different framework into
this one by configuring just one method that calls out to the other
framework, and translates its results.

...

Many different techniques and software products have
been developed to identify the format of a data file; it is
still a somewhat
mysterious art.
The best choice of methods and their ordering depends on your archive's
users and the materials they most often submit.

Plugins with the best chance of precisely identifying a format should
usually be first on the list, if they
use the default result-ordering method that gives priority to the first hits
among others of equal confidence.

Note that some plugins
may depend on other
identification methods running before they do because
they refine an identification
already found on the results list. Special relationships
like that must be well documented so the administrator is aware of them.

Each <tt>FormatIdentifier</tt> plugin applies its special knowledge or
resources to attempt to identify the format of the Bitstream; it is not
responsible for solving the whole problem. For example, take a plugin
that executes a heuristic to detect comma-separated-values files. It might
collaborate with another method that detects plain-text files, so that it only
applies its algorithm to refine the format identification
if it sees from the results that the file is plain text.

...

One problem that has not yet been completely addressed by this
design is that many format-identification methods require random access to the contents of a Bitstream, but the Bitstream API only offers
serial access through a Java <tt>InputStream</tt>. Random access
means reading a sequence of bytes from the Bitstream starting at
any point in its extent; this is very helpful when looking for an
internal signature to identify the file, since the signature may
be located relative to the end of the file or at some larged offset into it.

There are techniques to compensate for the lack of random access,
although they sacrifice efficiency. It may also be necessary to
add a method to <tt>Bitstream</tt> to retrieve a random-access stream
when the underlying storage implementation supports it.

...

Follow these steps when comparing two format identification hits
to determine which has priority. This is implemented as
the method <tt>FormatHit.compareTo()</tt>.

...

This is the default algorithm that is implemented by
<tt>FormatIdentifier.identifyFormat()</tt> methods that simply
call the <tt>FormatHit</tt>'s <tt>addToResults()</tt> method on each
hit they develop.

<blockquote>NOTE: It is not necessary to use this algorithm.
As described above, the format identification process is completely
under the control of the identifyFormat() method implementations.
</blockquote>

...

To select a <tt>BitstreamFormat</tt> from the results,
follow these steps:

  1. Starting with the first result, take the first format identifier and namespace that can successfully be resolved into a <tt>BitstreamFormat</tt> (importing a new one if necssary).
  2. If no <tt>BitstreamFormat</tt> is available, result is the unknown one, and set the confidence to <tt>UNIDENTIFIED</tt>.

...

If an application wants to generate a dialog showing
all of the results of an automatic format identification
(e.g.
to give an interactive
user the chance to second-guess the automatically-chosen format)
it could call the plugins and process the results according to the
algorithm above. We don't anticipate anyone wanting such
a service, but if it comes up, we can always add another method to
<tt>FormatIdentifierManager</tt>.

...

The properties of a <tt>Bitstream</tt> describing its format and the
confidence of its identification have analogues within the
<tt>FormatHit</tt> structure. The logic to map between them is encapsulated
within
the <tt>FormatHit.applyToBitstream()</tt> method, so there
is only one piece of code to update if
either of those objects changes in the future.

...

As mentioned before, each <tt>FormatIdentifier</tt> implementation
only has to do part of the job, so it can be very narrowly focused.
It can also look at the results of previous methods to decide if it
has anything to add to the overall solution. For example,
a method that heuristically identifies text-based formats would
only proceed if it saw that a previous method had identified the data
as generic plain text.

Each method may add several <tt>FormatHits</tt> to the result list,
or none at all.

As an example of the power and flexibility of this approach, imagine
a site that accepts many different kinds of XML formats, and needs precise
identification of a few of them (e.g. METS documents of various profiles).

...

A conflict arises when the automatic identification process returns
hits for incompatible formats. This is commonly caused by
contradictory clues in a Bitstream, for example, a filename extension
that a different format than the one indicated by internal signature matches.
Consider a Bitstream
containing a well-formed XML document; the "internal signature" method
correctly identifies it as XML. However, its name ends with <tt>".txt"</tt>
which is only listed as an external signature for other kinds of formats.

We may wish to record a warning (e.g. in the server log) when the results
include such a conflict. There is no place to record it in the data model,
but since warnings are mainly of interest to an archive administrator
tuning identification or diagnosing problems, the log should be good enough.

...

The UI needs to display the data format of a Bitstream to the user in a
meaningful way.
Historically, this has been accomplished with
the <tt>name</tt> (formerly "short description") property
of the BSF, which is a short human-readable label such as "Adobe PDF 1.2".
In some contexts it may be helpful to also cite the confidence property
of the Bitstream to indicate how the format was discovered so the user
can tell how much to trust it.

There are also some contexts - e.g. when offering a selection among Bitstreams
to download - when it is helpful to include the MIME type of each
Bitstream as well.
Although MIME types are not meaningful to all end-users, they
do tell the more technically sophisticated ones what the browser (or other HTTP
client) is likely to do with a Bitstream.

...

In the first case, the problem is easily solved, but not so useful in
most applications; if you are choosing a new format for a Bitstream, why
limit your choice only to formats of Bitstreams already in the archive?
It is mostly useful when your purpose is to choose among
<tt>BitstreamFormat</tt>s, e.g. picking one to edit.

The second case is more helpful when actually selecting a format
for a real Bitstream, but
brings with it other problems - e.g. how to obtain and navigate
formats from an external registry:

...

For good reasons, we propose to sidestep the whole problem of
interfacing DSpace internals to an external registry to choose a format,
and leave it to be settled between the DSpace user interface
application and the
external registry.
Here is why:

Current data format registries are large and growing – PRONOM has over 400 already,
the TrID proprietary registry
has over 2,600 entries,
and the GDFR stands to acquire
thousands when it becomes operational.
They each have distinctive taxonomy and
relationship metadata to help navigate the format space; for example, GDFR
is developing a faceted classification system. Although GDFR may emerge
as the standard, there is currently no effective standard for data format
metadata.

Also, there are potentially several interchangeable DSpace UIs, each
with differing capabilities and styles.

Given the complexity of implementing solutions for the cross product of
format registries and DSpace UIs,
we think it is more productive to let each UI negotiate with the format
registry of its choice to produce a navigible display of formats.
For example, the UI can transfer control to a popup or dialog encapsulating
the registry's UI or a registry-specific extension. All it has to do is
return a format identifier that
can be resolved or imported to a
<tt>BitstreamFormat</tt>.

The alternative is to force all registries into a common model, which
would probably deprive them of the metadata most helpful to generating
a good navigation interface. Each registry has unique features
in its data model to facilitate browsing.

...

Since automatic format identification is powerful and fairly
reliable, it makes sense to use it to assist users in identifying the
format of their submissions, at least by narrowing down the choices.
The automatic process yields a list of hits, which should be
presented differently from a list of available formats:

...

The external registries are chosen by adding their names to a plugin interface
as shown here. Note that the plugin name, which is also the namespace it
covers, gets supplied by plugin itself through the <tt>getPluginNames()</tt>
method. The order is not significant.
This configuration example includes registries implementing
the PRONOM, DSpace, and Provisional namespaces (guessing from the classnames).

http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.asp?status=listReport#Image Added
Panel

plugin.selfnamed.org.dspace.content.format.FormatRegistry = \
org.dspace.content.format.PRONOMFormatRegistry, \
org.dspace.content.format.DSpaceFormatRegistry, \
org.dspace.content.format.ProvisionalFormatRegistry
#

  1. initialization files configured as "contact URIs":
    formatRegistry.DSpace.document = /dspace/config/registries/dspace-formats.xml
    formatRegistry.DSpace.validate = true
    formatRegistry.DSpace.schema = /dspace/config/registries/formats.xml
    formatRegistry.Provisional.document = /dspace/config/registries/provisional-formats.xml
    formatRegistry.Provisional.validate = true
    formatRegistry.Provisional.schema = /dspace/config/registries/formats.xml
    formatRegistry.PRONOM.contact =
No Format

Format identifiers are configured in a sequence plugin, as in this example:

...

The conversion process alters the archive as little as possible. If the
backward-compatible DSpace namespace is configured, existing formats
are simply mapped to that registry. Otherwise, every Bitstream has to be
automatically re-identified.

...

The following reports are helpful to administrators to check and validate
format-related configuration options, and to plan preservation activities.
They are generated by command-line administrative applications.

...

This report is intended to let administrators see the effect of changes
to the format-identification configuration by testing how existing
Bitstreams would now be identified. Given a group of DSpace objects
to work against, this test runs the automatic format identification
against each member Bitstream but does not change anything.

It reports the number of cases where the re-identification reaches
a different result than the existing identification, and optionally
shows each one in a detailed report. The report includes:

...

Operating over a range of selected Bitstreams, this report shows
the number of Bitstreams identified as each format. This lets the
administrator see what formats are in use, and the relative proportion
of each one, for the selected Bitstreams. It can be helpful when
planning for preservation, since it immediately shows the number of
Bitstreams in problematic formats.

...

Some administrators will undoubtedly have a need to make local
customizations to the descriptive and technical metadata for data formats.
These attributes of a <tt>BitstreamFormat</tt> may all be customized
by overriding the values imported from the remote registry – and the
overrides persist even when the BSF is updated from its external registry.

...

Update the local copies of technical metadata from the originals in
remote format registries. Options are:

...

A timestamp of last update is maintained for all BSFs. When performing a
group update, use the timestamp farthest in the past as the limit when
searching for changed formats in the remote registry. After a group update,
set the time of last update of all relevant BSFs to the time of this
operation.

Edit <tt>Bitstream</tt> Technical Metadata

Here are the cases where a Bitstream's format technical
metadata must be modified:

...

Rerun the automatic format identification, perhaps after configuration
changes or improvements to a remote registry. This is the same
operation as the automatic format identification performed on ingest.

It can be done both interactively, for a single Bitstream, or in batch
for a set of selected Bitstreams. The interactive UI should offer
a choice of viewing and accepting the new identification choice.

...

The DSpace registry is updated by modifying or replacing its XML
configuration file. The new contents will be loaded automatically when
DSpace is next started.

Similarly, the Provisional registry is maintained by editing its
XML configuration file. Its contents depend entirely on the
local administrator, however.

...

Before removing an external format registry from the
configuration, all references to it must be removed from BSFs.
There is a utility administrative tool to manage this automatically,
with proper checks, i.e. so it does not delete the last (primary) external
format identifier from a BSF which is in use.

In a typical scenario, the archive administrator decides to get
rid of one of the external format registry plugins, e.g. the "DSpace"
registry. By removing it from all of the BSF entries first, she ensures
there will not be any reference failures after it is removed from
the plugin configuration – although theoretically, the only normal DSpace
operation that would fail is an update from the remote format registry.

...

Please use the Discussion tab
for your comments on and reactions to this proposal, since comments
mingled with this much text would be too hard to find.

...

Although adequate support for data format representation is a necessary
foundation for preservation activities, this work does not include any
actual preservation functions. These are all subjects for other projects:

...

This proposal does not include any explicit support for features of
container formats, that is, data formats such as "tar" and "Zip" which
serve primarily as "wrappers" for other data objects. Containers
typically implement some or all of the following functions:

...

Attaching thorough documentation about the interpretation of a data format
(i.e. standards documents) is an important
preservation tool. We do not need to make any provisions for this
within DSpace if we trust external data format
registries to maintain it as format technical metadata.

In particular, the GDFR architecture allows for a locally-administered
GDFR node,
with full local copies of all data, to be
integrated into the DSpace software. It includes provisions for
documentation on interpreting each format. We believe we can rely on
the GDFR to solve this problem when it is fully developed.

...

Please use the Discussion Page
for your comments on this page.

...

...