Page History
...
Here is a summary of all of the proposed changes, by section.
Content Model
...
Bitstream
Each <tt>Bitstream</tt> Bitstream
still refers to a <tt>BitstreamFormat</tt> BitstreamFormat
object to identify its data format. In addition, the <tt>Bitstream</tt> Bitstream
gains two new properties:
- A format confidence metric, which indicates (on a coarse symbolic scale) the certainty of the identification of its format, reflecting both accuracy and precision.
- The source of the format identification, indicating the tool or mechanism responsible for the format technical metadata in this Bitstream.
...
BitstreamFormat
Although outwardly similar and largely backward-compatible, the <tt>BitstreamFormat</tt> BitstreamFormat
has been completely gutted and re-implemented. It now serves as a local "cache" of format technical metadata and holds one or more external format identifiers, each of which refers to a complete technical metadata record in an external data format registry.
...
We add a plugin interface to provide access to external data format registries. Each registry is modeled as an implementation of the <tt>FormatRegistry</tt> FormatRegistry
interface. It is fairly simple; it only supports "importing" a format description into the local <tt>BitstreamFormat</tt> BitstreamFormat
cache, updating an existing format, and a few queries.
...
Automatic Format Identification
The old <tt>orgorg.dspace.content.
FormatIdentifier</tt> FormatIdentifier
is replaced by a configurable, extensible, plugin-based format identification framework. It is not part of the format registry plugin, because while some format recognition services live in a registry's software suite, others are independent of any registry.
...
The following exceptions are thrown when a fatal error is encountered in the format registry and identification framework. They are similar in meaning to existing exceptions in the DSpace API, such as <tt>AuthorizeException</tt> AuthorizeException
– signalling a fatal error with enough context and explanation to communicate the cause to the user or administrator.
...
FormatRegistryException
Sent when there is a fatal error while accessing an external format registry or updating the local cache of format metadata in the DSpace RDBMS. Can be caused by incorrect or missing configuration entries, network problems, filesystem problems, etc.
...
FormatRegistryNotFoundException
A subclass of <tt>FormatRegistryException</tt> FormatRegistryException
, this exception is sent in particular cases when looking up an external identifier fails although it should have been found (e.g. since it had been found before). In the common case of looking up an identifier for the first time, e.g. through <tt>BitstreamFormatBitstreamFormat.findByIdentifier()
</tt>, no exception gets thrown because failure is a possibly-expected result.
...
FormatIdentifierException
Thrown when a format identification method encounters a fatal error which would cause it to return a false negative result. For example, if its configuration is missing or incorrect, the method throws this exception rather than silently failing. Simply failing to identify a format is in the realm of expected results and does not cause an exception.
...
FormatIdentifierTemporaryException
Thrown by the format identification method when it fails because of a "temporary" problem, e.g. when a network resource is not available. This subclass of FormatIdentifierException
tells the identifier manager that it may succeed when retried later.
...
Bitstream
The most significant change is that the Bitstream
now remembers the confidence of its format identification, an enumerated value which indicates the certainty and source of its format identification. There is also a convenience method to access the automatic format identification: since it almost always used to set the Bitstream
's format anyway, this improves code clarity.
Here is an interface view of the API additions:
...
FormatConfidence
Panel |
---|
// Ordered symbolic values of format-identification confidence: |
...
Panel |
---|
public class Bitstream extends DSpaceObject |
...
BitstreamFormat
The <tt>BitstreamFormat</tt> BitstreamFormat
class, which we will abbreviate as BSF, is essentially gutted and replaced with a new implementation. As described above, it now serves as a local "cache" of technical metadata that comes from external data format registries. Every BSF is bound to at least one format identifier in an external registry so its format technical metadata can be expressed in a way that is recognized outside of DSpace.
...
In the current (version 1.4.x) codebase, the <tt>BitstreamFormat</tt> BitstreamFormat
object has acquired uses and meanings beyond simply describing a Bitstream's data format – but these interfere with intended purpose in preservation activities. For example, the BSF has an internal flag which directs the UI to hide Bitstreams of that format from casual view. In an unmodified DSpace installation, the <tt>"License"
</tt> BSF is the only one for which internal is true, to keep deposit license files from appearing in the Web UI. Unfortunately, this usage cripples <tt>"License"
</tt> as an actual format descriptor, since it gets applied to all sorts of Bitstreams that contain licensing information no matter what their actual format. XML, plain-text, and RDF files are all tagged with the "License" BSF to make them disappear, yet it says nothing about the format of their contents.
...
This is the new formal definition of a <tt>BitstreamFormat</tt> BitstreamFormat
:
- Each BSF represents a description of a single, unique data format; there is exactly one BSF for each distinct data format referenced by Bitstreams in the DSpace archive.
- A BSF is bound to one or more entries in external data format registries.** The identifiers are logically all peers, although the metadata cached in the BSF is only imported (or updated) from one of them.
- All external format identifiers which describe the equivalent format must be bound to the same DSpace BSF – in other words, there should never be two BSFs describing the same conceptual format, such as "PDF Version 1.2"; one BSF encompasses all synonym external identifiers.
- The BSF's function is to describe the data format of the contents of a Bitstream, and nothing more.
- Application code must not "overload" a BSF with additional implicit meanings, such as marking Bitstreams invisible in a UI or indicating a function such as the deposit license.
- One special BSF, the unknown format, represents the unknown or unidentified data format.
- Every Bitstream refers to exactly one BSF:
- If its format has not been assigned or identified, it is the unknown format.
- This allows an application to assume every Bitstream has a valid BSF with all of its attendant properties, so e.g. it can get a valid MIME type.
In the content model, a <tt>BitstreamFormat</tt> BitstreamFormat
aggregates all the format-related technical metadata for Bitstreams of its type. Not only does this save space, it lets an administrator make changes and adjustments to that metadata easily in one place.
...
This new implementation makes the <tt>BitstreamFormat</tt> BitstreamFormat
a local cache for the relevant format metadata, but mainly it acts as a reference to the full technical metadata found in one or more external format registries (like the GDFR). It only caches the metadata immediately needed by DSpace, such as MIME type, name, description. This is adequate for everyday operation of the archive; DSpace never has to go to the external format registry for metadata.
...
The standard namespace values are available as public static fields on the <tt>FormatRegistryManager</tt> FormatRegistryManager
class. The LOC namespace is not really a registry yet but it makes sense to reserve the namespace since it is a significant source of format technical metadata.
...
Note that MIME types cannot be BSF identifiers because they violate the rule that only one BSF may be bound to each identifier. MIME types are imprecise, many BSFs have the same MIME type; e.g. a lot of XML-based are tagged <tt>"text/xml"
</tt>.
Apple UTIs
Apple Computer has developed what is essentially an alternative to MIME types called Uniform Type Identifiers (UTIs). It is an interesting development, although not directly relevant. Although the UTI database is, in a sense, a registry of format identifiers, it is not a good candidate for use in DSpace for several reasons:
...
Removing the "Internal" Flag
This renovation removes <tt>BitstreamFormat</tt> BitstreamFormat
's "internal" flag, which was originally intended to guide UI applications in hiding certain classifications of Bitstreams.
...
Its presence is actually harmful: not only does it have nothing to do with describing the format of the data, it actually encouraged usage that obscures the data format. The one <tt>"License"
</tt> BSF was applied to all Bitstreams containing an Item's deposit license and Creative Commons licenses, no matter what their actual data formats. The Creative Commons license consists of three Bitstreams of distinct actual formats – e.g. one is RDF. It is misnamed with the <tt>"License"
</tt> format so it will not be properly preserved.
In the DSpace@MIT registry, I have determined that <tt>"License"
</tt> is the only <tt>BitstreamFormat</tt> BitstreamFormat
for which "internal" is true, and that Bitstreams whose format is "License" appear only in Bundles named "LICENSE" and "CC_LICENSE". Therefore, we can determine the internal-ness (i.e. invisibility) of a Bitstream by its its owning Bundle as accurately as by the bogus BSF. It makes no practical difference which technique is used, but the Bundle-name cue is a better fit with the current content model. It works just as well under the DSpace+2.0 content model, since bundles evolve into "manifestations".
...
Panel | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||
|
...
BitstreamFormat
API
Here is the new API for <tt>BitstreamFormat</tt> BitstreamFormat
.
Add the following methods:
...
Panel | ||
---|---|---|
|
...
FormatRegistryManager
Following DSpace coding conventions, the factory and static class for a service is named with the suffix <tt>-
Manager</tt>Manager
. The <tt>FormatRegistryManager</tt> FormatRegistryManager
class gives access to instances of <tt>FormatRegistry</tt> FormatRegistry
. Since a format identifier is directed to a <tt>FormatRegistry</tt> FormatRegistry
implementation by its namespace, the Manager also takes care of selecting the right instance for a namespaced identifier. This lets applications use namespaced identifiers without worrying about taking them apart to choose a registry instance.
Since all of the <tt>FormatRegistryManager</tt> FormatRegistryManager
's state is effectively managed by the Plugin Manager, it does not need any state itself and only has static methods.
...
Panel | ||
---|---|---|
|
...
FormatRegistry
The <tt>FormatRegistry</tt> FormatRegistry
interface models an external data format registry. We define data format registry as any formally organized and administered collection of technical metadata about data formats. This may include a collection published mainly for human consumption such as the Library of Congress Sustainability of Digital Formats format catalog, as well as those accessible through public APIs such as the GDFR and DROID. The only requirement is that the data formats are named by unchanging, unique identifiers.
A format registry becomes available to DSpace through a Named Plugin implementing the <tt>FormatRegistry</tt> FormatRegistry
interface.
A registry has these functions, explained in detail below:
- Resolve a reference to a format identifier, and create a new <tt>BitstreamFormat</tt>
BitstreamFormat
representing the external format. - Update the local cache of metadata in a <tt>BitstreamFormat</tt>
BitstreamFormat
to the current state of the external registry. - Answer "conformance" queries - judge whether a Bitstream in one format would also conform to another format (i.e. former is a subtype of latter).
...
Format registry implementations are tightly coupled with DSpace. By this we mean they must be able to respond to frequent queries quickly and with low latency, and high reliability. The format registry must be available to complete some common operations such as ingestion and selection of applications like <tt>MediaFilter</tt> MediaFilter
.
This is only likely to be an issue at all with registries that are attached through the network. Registries that exist in local data files or RDBMS tables can share server resources with the DSpace archive.
Network-based registries might use a local "cache" server sharing the DSpace host to increase reliability. The GDFR architecture explictly encourages this sort of configuration. Otherwise, it might be necessary for the <tt>FormatRegistry</tt> FormatRegistry
implementation to add caching of its own to increase performance.
...
Although some of the tools to automatically identify formats are tied to format registries, this registry interface does not have anything to do with format identification. The identification tools are accessed through a separate plugin interface, discussed below.
...
FormatRegistry
API
Here is the API of the <tt>FormatRegistry</tt> FormatRegistry
. The plugin's name is also the DSpace string value representing its namespace. It is implemented as a self-named plugin, so that the instance itself knows its namespace without depending on each DSpace administrator to get it right. The namespaces must be consistent between DSpace installations so that format technical metadata (i.e. PREMIS elements in AIPs) can be meaningfully exchanged.
...
The unknown BSF is installed with the system, but for consistency, it is also derived from an entry in an "external" format registry. Since it is the only BSF which is absolutely mandatory, this registry must always be available, so it is a hard-coded registry that is always configured.
The <tt>FormatRegistryManager</tt> FormatRegistryManager
maps the namespace <tt>DSpaceDSpace-
Internal</tt> Internal
to a special registry object which only recognizes the "Unknown" format identifier. The first reference to that format identifier, e.g. by the method <tt>BitstreamFormatBitstreamFormat.findUnknown()
</tt>, "imports" it to create the unknown format BSF.
...
To add entries to the Provisional format registry, the DSpace administrator edits its configuration file (in a documented XML format similar to the current <tt>bitstreambitstream-formats.
xml</tt> xml
initialization file) and restarts any relevant DSpace processes. Since changes should be very infrequent this should not be a burden.
...
The "DSpace" registry includes most of the traditional, loosely-defined, format names, like <tt>"Text", "Adobe PDF", "HTML"
</tt>. It offers a simple solution for DSpace administrators who do not need precise and detailed format identification, nor the digital preservation tools that require it. Since it includes most of the formats from previous DSpace releases under their same names, it also gives a degree of backward-compatibility.
...
As soon as a new format does become available in some external registry, you can add the new external identifier to its <tt>BitstreamFormat</tt> BitstreamFormat
, perhaps updating the BSF's local metadata from its external registry.
Ideally, you will only employ Provisional formats when there will eventually be an entry in a globally-recognized registry for the format. For example, if you are adding a format to the GDFR but need to apply it to a Bitstream immediately, before the GDFR editorial process accepts it, you could create it in the Provisional registry to have it available immediately. Later, once the GDFR has an entry for it, add the GDFR identifier to the <tt>BitstreamFormat</tt> BitstreamFormat
you already created. Then, DIPs of objects in that format will bear the GDFR format identifier that is recognizable to other archives, and your Bitstreams will also have linkage to any preservation metadata in the GDFR.
...
The framework is a common API to which format identification services conform. This lets DSpace treat them as a "stack" of plugin implementations, trying each one in turn and choosing the best of all their results. The API consists of:
FormatHit
<tt>FormatHit</tt>
A class encapsulating the information returned by a single potential format identification "hit"FormatConfidence
<tt>FormatConfidence</tt>
A set of enumerated values that quantifies the "confidence" (certainty, accuracy) of format identifications for comparison on a common basis.FormatIdentifier
<tt>FormatIdentifier</tt>
Interface of the plugin class that actually identifies formats.FormatIdentifierManager
<tt>FormatIdentifierManager</tt>
Static class to operate the plugin stack and return a format identification verdict.
The
...
FormatHit
Object
A <tt>FormatHit</tt> FormatHit
is a record of the results of one format-identification match. It contains the following fields:
...
Each attempt at automatic identification of a Bitstream's format returns a Collection of <tt>FormatHit</tt> FormatHit
objects, representing the possible matches. The list is sorted by accuracy and confidence of hit.
Confidence of Format Identification Hits
The <tt>FormatHit</tt> FormatHit
includes a confidence metric, which represents the accuracy and certainty of the identification. It is an enumerated type of ordered, symbolic values implemented as a Java 5 enumeration.
The specific values are described above, under the description of the <tt>FormatConfidence</tt> FormatConfidence
object.
<tt>FormatHit</tt> FormatHit
includes a confidence rating so hits can be compared on the basis of confidence, and so it can be stored in the <tt>Bitstream</tt> Bitstream
object whose format was identified.
The confidence values have a greater range and granularity than seems possible given DSpace's simple format model; i.e. DSpace does not distinguish betwen "generic" and "specific" formats. However, the actual automatic format identification is done by plugin implementations, some of which are driven by external format registries. These have access to more sophisticated format models and data, including notions of format granularity, so the confidence metrics reflect that.
...
FormatIdentifier
Interface
Automatic format identification is accomplished by plugins implementing the <tt>FormatIdentifier</tt> FormatIdentifier
interface. Each plugin applies its own technique toward identifying the format of the Bitstream. There is no direct relationship between external data format registries and format identifying plugins: a single plugin can utilize several registries or none, and different plugins can use the same external registry.
Note that the <tt>FormatHit</tt> FormatHit
returned by the identification process contains an external format identifier, not a <tt>BitstreamFormat</tt> BitstreamFormat
. The archive administrator is responsible for ensuring that all external format identifiers returned by automatic identification methods can be imported, i.e. that the relevant registries are configured.
...
Panel |
---|
package org.dspace.content.format; |
The <tt>identifyFormatidentifyFormat()
</tt> method attempts to identify the data format of the given Bitstream, and delivers its results by adding a new <tt>FormatHit</tt> FormatHit
at the appropriate point on the results list. It returns the resulting list (possibly either modified or replaced) as its value; the caller must anticipate that it may be a different Object.
An identifier method can add results to anyplace in the result list, or use the default algorithm implemented by <tt>FormatHitFormatHit.addToResults()
</tt>. It is described in the next section, Implementing Automatic Format Identification.
The identifer method should throw <tt>FormatIdentifierException</tt> FormatIdentifierException
when it encounters a fatal error that prevents it from properly identifying the format of the Bitstream. Otherwise, there would be no way to tell the difference between a Bitstream that does not match any of the formats this method identifies, and a fatal error in the identifying code (e.g. configuration problem), since the results list is simply returned unmodified in both cases.
When the identifier cannot return valid results because of a temporary condition that may be cleared up later – e.g. a network resource that is temporarily unavailable – it can throw the
Code Block |
---|
FormatIdentifierTemporaryException |
{{FormatIdentifierTemporaryException}}to indicate that the results may change in the future.
A note on the object lifecycle: One instance of <tt>FormatIdentifier</tt> FormatIdentifier
is created per JVM; it gets cached and reused. The <tt>identifyFormatidentifyFormat()
</tt> method is assumed to be thread-safe. If it is not, the implementing class should have it call an internal method which is synchronized on itself.
Typically, only the internal <tt>FormatIdentifierManager</tt> FormatIdentifierManager
code ever calls identification methods.
...
FormatIdentifierManager
This is a static class to operate the plugin stack and return a format identification verdict. Applications use it instead of calling the <tt>FormatIdentifier</tt> FormatIdentifier
plugin stack directly.
Panel |
---|
package org.dspace.content.format; |
The <tt>identifyFormat</tt> identifyFormat
method always returns a hit. If the Bitstream was not successfully identified, it makes up a hit containing the unknown format.
...
The format identification framework is based on a sequence plugin, which gives administrators complete freedom to add and rearrange identification methods.
The <tt>FormatIdentifierFormatIdentifier.identifyFormat()
</tt> method is very powerful; it actually controls the entire process of automatic format identification, even though it is called from deep within the framework. The <tt>FormatIdentifierManager</tt> FormatIdentifierManager
only calls the stack of identifier methods in order and collects the results they provide. Each DSpace administrator has complete control of the methods run and the order of their execution, and the methods determine the results. The format identification API was designed to be very flexible, and also to make it easy to implement new identification methods.
Each implementation of <tt>FormatIdentifierFormatIdentifier.identifyFormat()
</tt> can do whatever it wants with the Bitstream and list of results it is given. It might be a "filter" method that prunes the results of any below a certain level of confidence. It could look at other results and try to refine them, or reorder them.
...
Note that some plugins may depend on other identification methods running before they do because they refine an identification already found on the results list. Special relationships like that must be well documented so the administrator is aware of them.
Each <tt>FormatIdentifier</tt> FormatIdentifier
plugin applies its special knowledge or resources to attempt to identify the format of the Bitstream; it is not responsible for solving the whole problem. For example, take a plugin that executes a heuristic to detect comma-separated-values files. It might collaborate with another method that detects plain-text files, so that it only applies its algorithm to refine the format identification if it sees from the results that the file is plain text.
...
One problem that has not yet been completely addressed by this design is that many format-identification methods require random access to the contents of a Bitstream, but the Bitstream API only offers serial access through a Java <tt>InputStream</tt> InputStream
. Random access means reading a sequence of bytes from the Bitstream starting at any point in its extent; this is very helpful when looking for an internal signature to identify the file, since the signature may be located relative to the end of the file or at some larged offset into it.
There are techniques to compensate for the lack of random access, although they sacrifice efficiency. It may also be necessary to add a method to <tt>Bitstream</tt> Bitstream
to retrieve a random-access stream when the underlying storage implementation supports it.
...
Follow these steps when comparing two format identification hits to determine which has priority. This is implemented as the method <tt>FormatHitFormatHit.compareTo()
</tt>.
- If the Namespace of the identifier cannot be resolved (looked up), i.e. because there is no FormatRegistry configured for it, that hit loses. See <tt>FormatRegistryManager
FormatRegistryManager.find()
</tt>. - Order hits by their FormatConfidence index. Thus, hits based on the content of the Bitstream rate more highly than ones based on external attributes like the name.
- Between two equal hits, if one has a non-null warning it is ranked lower.
- When a hit has a conflict (that is, there is a lower-ranked hit which disagrees with it because the MIME type is different or similar), it is ranked below hits of the same confidence.
...
This is the default algorithm that is implemented by <tt>FormatIdentifierFormatIdentifier.identifyFormat()
</tt> methods that simply call the <tt>FormatHit</tt> FormatHit
's <tt>addToResultsaddToResults()
</tt> method on each hit they develop.
...
- Start with an empty results list.
- Call the <tt>FormatIdentifier
FormatIdentifier.identifyFormat()
</tt> method of each plugin in the sequence in turn:- Passing it the Bitstream and list of accumulated results so it can add new results.
- If it has a better-confidence match than the current head of the list, that hit becomes the new head of the list.
- Otherwise the hit gets appended to the end of the list.
- When finished, the head of the list is the best format match.
To select a <tt>BitstreamFormat</tt> BitstreamFormat
from the results, follow these steps:
- Starting with the first result, take the first format identifier and namespace that can successfully be resolved into a <tt>BitstreamFormat</tt>
BitstreamFormat
(importing a new one if necssary). - If no <tt>BitstreamFormat</tt>
BitstreamFormat
is available, result is the unknown one, and set the confidence to <tt>UNIDENTIFIED</tt>UNIDENTIFIED
.
This logic is encapsulated in the <tt>FormatIdentifierManagerFormatIdentifierManager.identifyFormat()
</tt> method.
If an application wants to generate a dialog showing all of the results of an automatic format identification (e.g. to give an interactive user the chance to second-guess the automatically-chosen format) it could call the plugins and process the results according to the algorithm above. We don't anticipate anyone wanting such a service, but if it comes up, we can always add another method to <tt>FormatIdentifierManager</tt> FormatIdentifierManager
.
Applying the results to a
...
Bitstream
The properties of a <tt>Bitstream</tt> Bitstream
describing its format and the confidence of its identification have analogues within the <tt>FormatHit</tt> FormatHit
structure. The logic to map between them is encapsulated within the <tt>FormatHitFormatHit.applyToBitstream()
</tt> method, so there is only one piece of code to update if either of those objects changes in the future.
Implementing a
...
FormatIdentifier
plugin
As mentioned before, each <tt>FormatIdentifier</tt> FormatIdentifier
implementation only has to do part of the job, so it can be very narrowly focused. It can also look at the results of previous methods to decide if it has anything to add to the overall solution. For example, a method that heuristically identifies text-based formats would only proceed if it saw that a previous method had identified the data
as generic plain text.
Each method may add several <tt>FormatHits</tt> {{FormatHit}}s to the result list, or none at all.
...
- A signature-based format identifier notices that the file starts with <tt>
"<?xml"
</tt> and adds a hit for the generic format "XML", and MIME type <tt>texttext/
xml</tt>xml
. - The text heuristic identifier methods don't bother running because there is already a higher-confidence positive identification (of the generic XML format).
- A table-driven specific XML identifier method notices the hit for the generic XML format, and parses enough of the file to match one of the XPath specifications in its configuration. This identifies the file as an IMS Content Package manifest, MIT OCW version 1 profile.
- If the XML parse had failed, the plugin could add a warning to the generic "XML" hit since it was obviously not well-formed XML.
- Results include the "OCW-IMSCP" format first, "XML" next, and perhaps other generic hits after it.
...
A conflict arises when the automatic identification process returns hits for incompatible formats. This is commonly caused by contradictory clues in a Bitstream, for example, a filename extension that a different format than the one indicated by internal signature matches. Consider a Bitstream containing a well-formed XML document; the "internal signature" method correctly identifies it as XML. However, its name ends with <tt>".txt"
</tt> which is only listed as an external signature for other kinds of formats.
...
- Displaying a human-readable description of the format to the end-user.
- Choosing from among all available data formats:
- Just the formats in the <tt>BitstreamFormat</tt>
BitstreamFormat
table. - All formats available in selected external registries.
- Just the formats in the <tt>BitstreamFormat</tt>
- Selecting a <tt>BitstreamFormat</tt>
BitstreamFormat
from among the results of an automatic format identification, with the option of choosing freely as in Case #2. - Administrative interface to add and update <tt>BitstreamFormat</tt>
BitstreamFormat
objects. - Administrative interface modify chosen format of <tt>Bitstream</tt>
Bitstream
(existing UI can be used with minimal changes).
...
The UI needs to display the data format of a Bitstream to the user in a meaningful way. Historically, this has been accomplished with the <tt>name</tt> name
(formerly "short description") property of the BSF, which is a short human-readable label such as "Adobe PDF 1.2". In some contexts it may be helpful to also cite the confidence property of the Bitstream to indicate how the format was discovered so the user can tell how much to trust it.
...
First, what range of formats do you want to be able to choose from?
- All <tt>BitstreamFormat</tt>
BitstreamFormat
names? This typically includes only formats of objects that have already been imported into the archive. - All of the formats in a given external registry, or set of them? This is likely to be more complete, but brings with it the problem of handling a very large list, perhaps thousands of choices.
In the first case, the problem is easily solved, but not so useful in most applications; if you are choosing a new format for a Bitstream, why limit your choice only to formats of Bitstreams already in the archive? It is mostly useful when your purpose is to choose among <tt>BitstreamFormat</tt>s{{BitstreamFormat}}s, e.g. picking one to edit.
...
Given the complexity of implementing solutions for the cross product of format registries and DSpace UIs, we think it is more productive to let each UI negotiate with the format registry of its choice to produce a navigible display of formats. For example, the UI can transfer control to a popup or dialog encapsulating the registry's UI or a registry-specific extension. All it has to do is return a format identifier that can be resolved or imported to a <tt>BitstreamFormat</tt> BitstreamFormat
.
The alternative is to force all registries into a common model, which would probably deprive them of the metadata most helpful to generating a good navigation interface. Each registry has unique features in its data model to facilitate browsing.
...
- Indicate default choice of format and its MIME type.
- Ordering is significant, hits closer to the head of the list take precedence.
- The confidence metric on each hit is highly significant in helping the user evaluate them, so include it prominently and in a way that makes its values easy for naive users to understand.
- Offer an "escape route" to choose a format from all available formats. This becomes the default if automatic format identification failed.
Editing
...
BitstreamFormat
Table
See the next section for details, this is classed as an administrative operation.
Administrative Operations Relating to
...
{{BitstreamFormat}}s
The DSpace administrator manages data formats with these operations:
...
The external registries are chosen by adding their names to a plugin interface as shown here. Note that the plugin name, which is also the namespace it covers, gets supplied by plugin itself through the <tt>getPluginNamesgetPluginNames()
</tt> method. The order is not significant. This configuration example includes registries implementing the PRONOM, DSpace, and Provisional namespaces (guessing from the classnames).
...
The report includes, for each <tt>BitstreamFormat</tt> BitstreamFormat
in use,
- BSF's name
- External identifier(s)
- Count of Bitstreams referring to it.
Maintenance
Edit
...
BitstreamFormat
Metadata
Some administrators will undoubtedly have a need to make local customizations to the descriptive and technical metadata for data formats. These attributes of a <tt>BitstreamFormat</tt> BitstreamFormat
may all be customized by overriding the values imported from the remote registry – and the overrides persist even when the BSF is updated from its external registry.
...
A timestamp of last update is maintained for all BSFs. When performing a group update, use the timestamp farthest in the past as the limit when searching for changed formats in the remote registry. After a group update, set the time of last update of all relevant BSFs to the time of this operation.
Edit
...
Bitstream
Technical Metadata
Here are the cases where a Bitstream's format technical metadata must be modified:
...
Manually (interactively) force the choice of a new data format, chosen from either:
- The existing set of <tt>BitstreamFormat</tt>
BitstreamFormat
entries. - An explicitly specified namespace and identifier referencing an external format, which is imported if necessary.
- This may be a simple text-entry box since it doesn't have to be user-friendly.
...