<?xml version="1.0" encoding="utf-8"?>
<html>
Procedure to Upgrade a DSpace 1.5 archive to use the 1.6
BitstreamFormat+Renovation code.
You have to do a little planning before undertaking the conversion.
Review the appropriate documentation (e.g. About Data Formats and
BitstreamFormat Renovation)
and be able to answer the questions below.
Your most important decision is the choice of
which external format registries to configure.
This choice reflects the purpose of your DSpace archive, whether you are
concerned with digital preservation or just storage and dissemination.
Although the new DSpace architecture allows any
number of registries to be configured, it really only makes sense to
list one primary registry, and the Provisional registry
for locally-added (and hopefully temporary) format identifiers not covered
by the primary registry.
First, decide how you want data formats to behave in your archive:
Here are all the planning questions, in detail:
Here are the minimal steps you will follow to convert your archive
to the new BitstreamFormat model. We recommend keeping this list
handy and checking off each step as you complete it:
"cleanup -c" |
Upgrade15To16 -0 |
Upgrade15To16 -p <i>file</i> |
Upgrade15To16 -1 |
Upgrade15To16 -2 |
Upgrade15To16 -r -v > report.out |
Upgrade15To16 -3 |
The entire process may take several hours, depending on your choice of
format registry. We recommend converting a production DSpace system on
separate test server first, working with a copy of your archive. It can
share a read-only copy of the asset store with another system, since no
Bitstreams are actually written or changed.
The following instructions refer to two filesystem paths, so be sure you
know the correct locations for these:
$\{DSPACE_INSTALL\} |
build.xml |
$\{DSPACE_HOME\} |
dspace.dir |
Download Patch to DSpace 1.5 Source\\
$\{DSPACE_INSTALL\} |
ant update |
$\{DSPACE_HOME\} |
config/formats |
cp -rp config/formats $\{DSPACE_HOME\}/config |
Add these entries to your DSpace Configuration for the case you chose:
|
If you configure PRONOM as your primary format registry, you need
the DROID identifier. We also recommend using the Text Heuristic,
CSS, and TextSubtypeHelper identifiers to make up for shortcomings in DROID.
|
Before executing the BitstreamFormat conversion, you must ensure the
Bitstream information in the database and the asset store are in agreement.
Otherwise, if the DB refers to a Bitstream with a missing asset file,
the conversion process will fail and waste all the time spent on it so far.
We intentionally designed the conversion to fail when an asset is missing or
unreadable because this is an unacceptable flaw in the archive, anyway.
If DSpace cannot access some Bitstreams, it cannot fulfil its primary
mission. It is reasonable to force the administrator to fix such problems.
We provide a testing and repair command, a variation on the
"cleanup" |
script, to find all Bitstreams with missing asset
files. It repairs them by marking the Bitstream deleted, and logging
it. The DSpace administrator should check the log for failures and
decide whether to fix or accept each one, since the next cleanup run
will wipe out the "deleted" Bitstreams.
Run the command:
${DSPACE_HOME}/bin/dsrun org.dspace.storage.bitstore.Cleanup -c |
If all goes well, it will display:
Checking for missing/unreadable files in asset store |
The number in the second line will be nonzero if it finds any problems.
See the server logs for ERROR records giving details about each Bitstream, e.g:
(the
bitstream_id |
column value appears as
id=209 |
)
2008-01-09 21:31:00,913 ERROR org.dspace.storage.bitstore.BitstreamStorageManager |
Note that it takes approximately 30 minutes to run on a fairly slow
server with about 160,000 Bitstreams. Runtime is a factor of the number
of Bitstreams, not their overall size.
You can also add the
-n |
option to the
Cleanup |
command to prevent it from marking broken Bitstreams as deleted. This
means, if there are any problems, it will NOT fix them, so you must
take care of them manually (or run it again). It may be helpful to use
the
-n |
option the first time you run this command, so you can
check whether any of the Bitstreams reported belong to Items.
This phase modifies your database to the new 1.6 schema, while preserving
old data to help in the conversion.
IMPORTANT: Ensure there is no interactive access to the archive
while you perform these steps, since any other DSpace process accessing
Bitstreams may receive incorrect data. Also, another process making
concurrent changes to the RDBMS may corrupt your data model.
For the first phase you must
cd |
to the directory containing
your DSpace installation, i.e. from where you run
ant update |
.
Ensure there is a file in the relative path
etc/database_schema_15-16.sql |
.
THERE IS NO WAY TO "UNDO" THIS STEP.
cd ${DSPACE_INSTALL} |
NOTE: You can run the
org.dspace.administer.Upgrade15To16 |
with various other options to try a "dry run" first, get verbose or
debugging output, generate a report, etc. Run it with the
--help |
option
for details.
At this point, all Bitstreams have an undefined format.
The two conversion Phases will
assign format values to them.
The purpose of the Provisional registry is to provide a separate
place for file format
entries added by the local DSpace administrator. It is named "Provisional"
to remind you that its entries are supposed to be a temporary measure,
until the format can be described in a shared external registry such as PRONOM
or the GDFR, or even the DSpace built-in registry.
The "Provisional" registry also gives you a separate, distinct, home for all
of your local changes and extensions to the format technical metadata.
Even for an archive configured with the old, simple "DSpace" format registry,
it is helpful to have your local additions in a separate place so
when the "DSpace" registry is modified in updates, the
changes won't affect or collide with your Provisional registry.
In the conversion process, you will generate a configuration file for the
Provisional registry based on the formats used in your archive. The
upgrade process will examine the old format registry and create Provisional
entries for any formats it finds there which were not part of the original
DSpace complement. You must still check and edit this list, to
ensure the data is correct and no undesireable formats are included.
For example, if you are using the PRONOM registry, then you should exclude
any Provisional versions of formats already known to PRONOM.
The content of the Provisional registry is dictated entirely by its
configuration file. This has the same XML-based format as the DSpace
registry's configuration file so you may consult that as an example,
see
config/formats/dspace-formats.xml |
under the runtime directory.
To populate the Provisional registry:
$\{DSPACE_HOME\}/bin/dsrun org.dspace.administer.Upgrade15To16 -p new.xml |
new.xml |
<entry> |
cp new.xml $\{DSPACE_HOME\}/config/formats/provisional-formats.xml |
The next upgrade phase will automatically check the validity of your
configuration file. It will stop if there is a problem.
This phase converts Bitstreams whose format has a direct analogue in
either the DSpace (if configured) or Provisional registry. If those
registries are not configured (or the Provisional registry is empty), it
does nothing.
You can run Phase 1 more than once; in fact, you'll have to
if you change the Provisional registry after the first run.
It does no harm to run Phase 1 again, since it does nothing
more if the registries have not changed.
Run this command:
${DSPACE_HOME}/bin/dsrun org.dspace.administer.Upgrade15To16 -1 |
Typical Phase 1 results (this example took about 4 minutes):
Phase One Summary: Bitstream Formats Converted: |
At this point you may make changes to the contents of the Provisional
registry, if, for example, fewer Bitstreams were converted than you
expected.
If you have no changes to make to the Provisional format registry you
may proceed to Phase 2.
This phase
assigns new BitstreamFormats to all Bitstreams not already converted,
by automatically identifying their formats.
It will not touch
Bitstreams which have already been identified by conversion.
After running Phase 2 you will not be able to run Phase 1 again.
You may wish to test this Phase first by adding the dry run and verbose
options
-n -v |
to the command and watching the output for a while:
${DSPACE_HOME}/bin/dsrun org.dspace.administer.Upgrade15To16 -2 -n -v |
Be sure you'll get adequate results when you run Phase 2 for real,
because it assigns some format to each as-yet unidentified Bitstream,
which means it will ignore them on subsequent runs.
Don't worry too much about getting everything perfect the first time, though.
You can always re-identify the formats of specific Bitstreams and
groups of them with the
BitstreamFormat Workbench
utility, even after the conversion is finished.
To invoke Phase 2:
${DSPACE_HOME}/bin/dsrun org.dspace.administer.Upgrade15To16 -2 |
Typical phase 2 results: it takes about 60 minutes to process
process 64,000 Bitstreams on a fast server,
(almost 5 hours for 155,00 Bitstreams on a slow server),
and the summary looks like:
Phase Two Summary: Bitstream Formats Guessed: |
You should only run Phase 2 once, since it does not leave any
Bitstreams unidentified, and it will not touch a Bitstream which has
already been identified.
You should produce a detailed report and save it just before finishing the conversion.
You can generate a report of the state of the conversion, and optionally
details about each Bitstream,
at any time before Phase 3 has run.
Invoke the the command:
${DSPACE_HOME}/bin/dsrun org.dspace.administer.Upgrade15To16 -r |
This produces a simple summary report like:
SUMMARY Total upgradeable Bitstreams = 64269 |
To add the details of how each
Bitstream was converted from the old format to its new one, add
the verbose (
-v |
) option, for example:
${DSPACE_HOME}/bin/dsrun org.dspace.administer.Upgrade15To16 -r -v |
This adds a line for each Bitstream in the following format:
|
We strongly recommend archiving a copy of the detailed report.
Run the reporting command above with the output directed into a file.
This gives you a record of the original old-style format of each Bitstream,
and the vital information about the initial conversion. You can also
analyze the report as shown in the next section to check the accuracy
of the conversion.
Here is how to analyze the report to verify the format conversions
make sense.
"report.out" |
awk -F, '/^BITSTREAM/ \{print $4,",",$5,",",$6\}' < report.out | sort -u |
Postscript |
PostScript 2.0 |
Adobe PDF |
ASCII Text |
For example, suppose you notice an anomaly and want to take a closer
look at the affected Bitstreams – perhaps examine a Bitstream itself to see
what format it really seems to be. The anomalous line from the summary report
is:
Adobe PDF , Windows Portable Executable , POSITIVE_SPECIFIC |
Use awk to pick out the relevant lines from the original report by
matching the old BSF name in the fourth column and the new BSF name in
the fifth column. It should be sufficient to match against a regular
expression representing part of each name rather than match the whole thing,
for example:
% awk -F, '$4 ~ /Adobe/ && $5 ~ /Executable/ {print}' < report.out |
To print only the bitstream DB ID number, modify the awk script:
% awk -F, '$4 ~ /Adobe/ && $5 ~ /Executable/ {print $2}' < report.out |
Now you can examine the Bitstream by visiting it through the DSpace web
server, e.g.
If corrections are necessary, use the BitstreamFormat Workbench
(
org.dspace.administer.Formats |
) to re-guess or manually
set the format based on that Bitstream number. You can do this after
Phase Three just as well as now, so it works even if you only discover
the anomaly long after finishing this conversion.
If you discover you've made a terrible mistake and mis-identified a lot
of Bitstreams, you can return to the state just before Phase One and
re-do all the steps (populating the Provisional registry, etc.).
Execute the following SQL statement in your database environment:
UPDATE Bitstream SET bitstream_format_id = NULL; |
Another way to return your system to that same state is to restore your
database from the backup you made before Phase 0 (right?!) and then re-run
Phase 0 and the subsequent steps.
Also see the last section about making repairs after finishing the conversion
procedure.
ONLY begin this phase when you are satisfied with the results of
Phase 2, and all of the Bitstreams in the archive have been converted.
This phase removes the remaining DSpace 1.5 BitstreamFormat data
structures from the RDBMS since they are no longer needed.
This phase checks that all Bitstreams have been converted and will only
run if they have been.
${DSPACE_HOME}/bin/dsrun org.dspace.administer.Upgrade15To16 -3 |
Typical output: (after a couple seconds of runtime)
Phase 3 finished, done with BitstreamFormat conversion. |
If a problem or mistake in the BitstreamFormat conversion only becomes
apparent after you finish the conversion process (Phase 3) and thus
cannot go back to it, you can still put it right. The
BitstreamFormat Workbench tool,
org.dspace.administer.Formats |
is a multi-purpose application
for examining and changing the formats of Bitstreams. See its documentation
for complete instructions on how to use it.
PRONOM/DROID is currently missing (or has a bug preventing it from
matching) some filename extensions, also known as
external signatures.
For example, though the PRONOM entry for an HTML format
includes both the common
three-letter (MS-DOS)
.htm |
extension and the conventional
.html |
extension,
DROID fails to recognize "
.htm |
" files.
Here is a temporary workaround procedure you can use until DROID is fixed,
and for possible future cases where PRONOM/DROID lacks signatures:
TextSubtypeHelper |
Here are detailed instructions for an example that adds the
.htm |
filename extension to
PRONOM:fmt/96 |
Create an entry for the Provisional registry like the one shown below:
<dsr:entry> |
PRONOM:fmt/96 |
Provisional:HTM |
If the PRONOM format was a subtype of Text, for the purpose of
TextSubtypeHelper |
, then you must add the Provisional
external identifier to the subtype list too.
Locate the configuration entry
formatIdentifier.TextSubtypeHelper.subtypes |
, and add
the external identifier
Provisional:HTM |
to it.
Find a Bitstream with the
.htm |
filename extension that
was not identified correctly before, and retry identifying it, e.g.
with the BitstreamFormat Workbench utility.
</html>