Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fix typo

...

DSpace uses a relational database to store all information about the organization of content, metadata about the content, information about e-people and authorization, and the state of currently-running workflows. The DSpace system also uses the relational database in order to maintain indices that users can browse.

Warning
title1.8 Schema

This Database schema is not fully up to date with DSpace 4.0. 2 additional table to store item level versioning information were added to DSpace 3 are currently not represented in this diagram.

DSpace 5 database schema (Postgres). Click on the thumbnail, then right-click the image and choose "Save as" to save in full resolution. Instructions on updating this schema diagram are in How to update database schema diagram.

DSpace 5 database schemaImage AddedImage Removed

Most of the functionality that DSpace uses can be offered by any standard SQL database that supports transactions. Presently, the browse indices However at this time, DSpace APIS use some features specific to PostgreSQL and Oracle, so some modification to the code would be needed before DSpace would function fully with an alternative database back-end.

The org.dspace.storage.rdbms package provides access to an SQL database in a somewhat simpler form than using JDBC directly. The main primary class is DatabaseManager, which executes SQL queries and returns TableRow or TableRowIterator objects. The InitializeDatabase class is used to load SQL into the database via JDBC, for example to set up the schema.

The database schema used by DSpace is initialized and upgraded automatically using Flyway DB. The DatabaseUtils class manages all Flyway API calls, and executes the SQL migrations under the org.dspace.storage.rdbms.sqlmigration package and the Java migrations under All calls to the Database Manager require a DSpace Context object. Example use of the database manager API is given in the org.dspace.storage.rdbms.migration package Javadoc.

The database schema used by DSpace is created by SQL statements stored in a directory specific to each supported RDBMS platform:

  • PostgreSQL schemas are in [dspace-source]/dspace/etc/postgres/
  • Oracle schemas are in [dspace-source]/dspace/etc/oracle/
    The SQL (DDL) statements to create the tables for the current release, starting with an empty database, aer in database_schema.sql. The schema SQL file also creates the two required e-person groups (Anonymous and Administrator) that are required for the system to function properly.

Also in [dspace-source]/dspace/etc/[database] are various SQL files called database_schema_1x_1y. These contain the necessary SQL commands to update a live DSpace database from version 1.x to 1.y. Note that this might not be the only part of an upgrade process: see Updating a DSpace Installation for details.

The DSpace database code uses an SQL function getnextid to assign primary keys to newly created rows. This SQL function must be safe to use if several JVMs are accessing the database at once; for example, the Web UI might be creating new rows in the database at the same time as the batch item importer. The PostgreSQL-specific implementation of the method uses SEQUENCES for each table in order to create new IDs. If an alternative database backend were to be used, the implementation of getnextid could be updated to operate with that specific DBMS.

The etc directory in the source distribution contains two further SQL files. clean-database.sql contains the SQL necessary to completely clean out the database, so use with caution! The Ant target clean_database can be used to execute this. update-sequences.sql contains SQL to reset the primary key generation sequences to appropriate values. You'd need to do this if, for example, you're restoring a backup database dump which creates rows with specific primary keys already defined. In such a case, the sequences would allocate primary keys that were already used.

Versions of the .sql files for Oracle are stored in [dspace-source]/dspace/etc/oracle. These need to be copied over their PostgreSQL counterparts in [dspace-source]/dspace/etc prior to installation.

Maintenance and Backup

When using PostgreSQL, it's a good idea to perform regular 'vacuuming' of the database to optimize performance. This is performed by the vacuumdb command which can be executed via a 'cron' job, for example by putting this in the system crontab:

Code Block
# clean up the database nightly
40 2 * * * /usr/local/pgsql/bin/vacuumdb --analyze dspace > /dev/null 2>&1

The DSpace database can be backed up and restored using usual methods, for example with pg_dump and psql. However when restoring a database, you will need to perform these additional steps:

  • The fresh_install target loads up the initial contents of the Dublin Core type and bitstream format registries, as well as two entries in the epersongrouptable for the system anonymous and administrator groups. Before you restore a raw backup of your database you will need to remove these, since they will already exist in your backup, possibly having been modified. For example, use:

    Code Block
    DELETE FROM dctyperegistry;
    DELETE FROM bitstreamformatregistry;
    DELETE FROM epersongroup;
    
  • After restoring a backup, you will need to reset the primary key generation sequences so that they do not produce already-used primary keys. Do this by executing the SQL in [dspace-source]/dspace/etc/update-sequences.sql, for example with:

    Code Block
    psql -U dspace -f  [dspace-source]/dspace/etc/update-sequences.sql
    

    Future updates of DSpace may involve minor changes to the database schema. Specific instructions on how to update the schema whilst keeping live data will be included. The current schema also contains a few currently unused database columns, to be used for extra functionality in future releases. These unused columns have been added in advance to minimize the effort required to upgrade.

Configuring the RDBMS Component

The database manager is configured with the following properties in dspace.cfg:

...

db.url

...

The JDBC URL to use for accessing the database. This should not point to a connection pool, since DSpace already implements a connection pool.

...

db.driver

...

JDBC driver class name. Since presently, DSpace uses PostgreSQL-specific features, this should be org.postgresql.Driver.

...

db.username

...

Username to use when accessing the database.

.  While Flyway is automatically initialized and executed during the initialization of DatabaseManager, various Database Utilities are also available on the command line.

New in DSpace 5.x: Metadata for all DSpace objects

There were several changes between the DSpace 4 and 5 database schema, related to the new "Metadata for all Dspace objects" improvements. Full detail can be found here:

Metadata for all DSpace objects 

Maintenance and Backup

When using PostgreSQL, it's a good idea to perform regular 'vacuuming' of the database to optimize performance. By default, PostgreSQL performs automatic vacuuming on your behalf.  However, if you have this feature disabled, then we recommend scheduling the vacuumdb command to run on a regular basis.

Code Block
# clean up the database nightly
40 2 * * * /usr/local/pgsql/bin/vacuumdb --analyze dspace > /dev/null 2>&1

 

Backups: The DSpace database can be backed up and restored using usual PostgreSQL Backup and Restore methods, for example with pg_dump and psql. However when restoring a database, you will need to perform these additional steps:

  • After restoring a backup, you will need to reset the primary key generation sequences so that they do not produce already-used primary keys. Do this by executing the SQL in [dspace]/etc/postgres/update-sequences.sql, for example with:

    Code Block
    psql -U dspace -f [dspace]/etc/update-sequences.sql
    


Configuring the RDBMS Component

The database manager is configured with the following properties in dspace.cfg:

db.url

The JDBC URL to use for accessing the database. This should not point to a connection pool, since DSpace already implements a connection pool.

db.driver

JDBC driver class name. Since presently, DSpace uses PostgreSQL-specific features, this should be org.postgresql.Driver.

db.username

Username to use when accessing the database.

db.password

Corresponding password ot use when accessing the database.

Custom RDBMS tables, colums or views

When at all possible, we recommend creating custom database tables or views within a separate schema from the DSpace database tables. Since the DSpace database is initialized and upgraded automatically using Flyway DB, the upgrade process may stumble or throw errors if you've directly modified the DSpace database schema, views or tables.  Flyway itself assumes it has full control over the DSpace database schema, and it is not "smart" enough to know what to do when it encounters a locally customized database.

That being said, if you absolutely need to customize your database tables, columns or views, it is possible to create custom Flyway migration scripts, which should make your customizations easier to manage in future upgrades.  (Keep in mind though, that you may still need to maintain/update your custom Flyway migration scripts if they ever conflict directly with future DSpace database changes. The only way to "future proof" your local database changes is to try and make them as independent as possible, and avoid directly modifying the DSpace database schema as much as possible.)

If you wish to add custom Flyway migrations, they may be added to the following locations:

  • Custom Flyway SQL migrations may be added anywhere under the org.dspace.storage.rdbms.sqlmigration package (e.g. [src]/dspace-api/src/main/resources/org/dspace/storage/rdbms/sqlmigration or subdirectories)
  • Custom Flyway Java migrations may be added anywhere under the org.dspace.storage.rdbms.migration package (e.g. [src]/dspace-api/src/main/java/org/dspace/storage/rdbms/migration/ or subdirectories)
  • Additionally, for backwards support, custom SQL migrations may also be placed in the [dspace]/etc/[db-type]/ folder (e.g. [dspace]/etc/postgres/ for a PostgreSQL specific migration script)

Adding Flyway migrations to any of the above location will cause Flyway to auto-discover the migration. It will be run in the order in which it is named. Our DSpace Flyway script naming convention follows Flyway best practices and is as follows:

  •  SQL script names: V[version]_[date]__[description].sql
    • E.g. V5.0_2014.09.26__DS-1582_Metadata_For_All_Objects.sql is a SQL migration script created for DSpace 5.x (V5.0) on Sept 26, 2014 (2014_09_24).  Its purpose was to fulfill the needs of ticket DS-1582, which was to migrate the database in order to support adding metadata on all objects.
    • More examples can be found under the org.dspace.storage.rdbms.sqlmigration package
  • Java migration script naming convention: V[version]_[date]__[description].java
    • E.g. V5_0_2014_09_25__DS_1582_Metadata_For_All_Objects_drop_constraint.java is a Java migration created for DSpace 5.x (V5_0) on Sept 25, 2014 (2014_09_25).  Its purpose was to fulfill the needs of ticket DS-1582, specifically to drop a few constraints.
    • More examples can be found under the org.dspace.storage.rdbms.migration package
  • Flyway will execute migrations in order, based on their Version and Date.  So, V1.x (or V1_x) scripts are executed first, followed by V3.0 (or V3_0), etc.  If two migrations have the same version number, the date is used to determine ordering (earlier dates are run first)

...

db.password

...

  • .

Bitstream Store

DSpace offers two means for storing content. The first is

  1. Storage in the file system on the server

...

  1. Storage using SRB (Storage Resource Broker)

. Both are achieved using a simple, lightweight API.

...

The BitstreamStorageManager provides low-level access to bitstreams stored in the system. In general, it should not be used directly; instead, use the Bitstream object in the content management API since that encapsulated encapsulates authorization and other metadata to do with a bitstream that are not maintained by the BitstreamStorageManager.

The bitstream storage manager provides three methods that store, retrieve and delete bitstreams. Bitstreams are referred to by their 'ID'; that is the primary key bitstream_id column of the corresponding row in the database.

As of DSpace version 1.1, there There can be multiple bitstream stores. Each of these bitstream stores can be traditional storage or SRB storage. This means that the potential storage of a DSpace system is not bound by the maximum size of a single disk or file system and also that traditional and SRB storage can be combined in one DSpace installation. Both traditional and SRB storage are specified by configuration parameters. Also see Configuring the Bitstream Store below.

...

For example, a bitstream with the internal ID 12345678901234567890123456789012345678 is stored in the directory:

Code Block
(assetstore dir)[dspace]/assetstore/12/34/56/12345678901234567890123456789012345678

...

  • Using a randomly-generated 38-digit number means that the 'number space' is less cluttered than simply using the primary keys, which are allocated sequentially and are thus close together. This means that the bitstreams in the store are distributed around the directory structure, improving access efficiency.
  • The internal ID is used as the filename partly to avoid requiring an extra lookup of the filename of the bitstream, and partly because bitstreams may be received from a variety of operating systems. The original name of a bitstream may be an illegal UNIX filename.
    When storing a bitstream, the BitstreamStorageManager DOES set the following fields in the corresponding database table row:
    • bitstream_id
    • size
    • checksum
    • checksum_algorithm
    • internal_id
    • deleted
    • store_number
  • The remaining fields are the responsibility of the Bitstream content management API class.

...

This cleanup can be invoked from the command line via the Cleanup class, which can in turn be easily executed from a shell on the server machine using /dspace/bin/dspace cleanup. You might like to have this run regularly by cron, though since DSpace is read-lots, write-not-so-much it doesn't need to be run very often. cleanup command, which can in turn be easily executed from a shell on the server machine using [dspace]/bin/dspace cleanup. You might like to have this run regularly by cron, though since DSpace is read-lots, write-not-so-much it doesn't need to be run very often.

Code Block
# Clean up any deleted files from local storage on first of the month at 2:40am
40 2 1 * * [dspace]/bin/dspace cleanup > /dev/null 2>&1

Backup

The bitstreams (files) in traditional storage may be backed up very easily by simply 'tarring' or 'zipping' the assetstore directory [dspace]/assetstore/ directory (or whichever directory is configured in dspace.cfg). Restoring is as simple as extracting the backed-up compressed file in the appropriate location.

...