Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

DSpace uses a relational database to store all information about the organization of content, metadata about the content, information about e-people and authorization, and the state of currently-running workflows. The DSpace system also uses the relational database in order to maintain indices that users can browse. 

DSpace 5 6 database schema (PostgresPostgreSQL). Right-click the image and choose "Save as" to save in full resolution. Instructions on updating this schema diagram are in How to update database schema diagram.

Image ModifiedMost of the functionality that DSpace uses can be offered by any standard SQL database that supports transactions. However at this time, DSpace APIS use some features specific to PostgreSQL and Oracle, so some modification to the code would be needed before DSpace would function fully with an alternative database back-end


  • DSpace uses FlywayDB to perform automated database initialization and upgrades.  Flyway's role is to initialize the database tables (and default content) prior to Hibernate initialization.
    • The org.dspace.storage.rdbms

...

    • .DatabaseUtils class manages all Flyway API calls, and executes the SQL migrations under the org.dspace.storage.rdbms.sqlmigration package and the Java migrations under the org.dspace.storage.rdbms.migration package. 
    • Once all database migrations have run, a series of Flyway Callbacks are triggered to initialize the (empty) database with required default content.  For example, callbacks exist for adding default DSpace Groups (GroupServiceInitializer), default Metadata & Format Registries (DatabaseRegistryUpdater), and the default Site object (SiteServiceInitializer). All Callbacks are under the org.dspace.storage.rdbms package.
    • While Flyway is automatically initialized and executed during startup, various Database Utilities are also available on the command line.  These utilities allow you to manually trigger database upgrades or check the status of your database.
  • DSpace uses Hibernate ORM as the object relational mapping layer between the DSpace database and the DSpace code.
    • The main Hibernate configuration can be found at [dspace]/config/hibernate.cfg.xml
    • Hibernate initialization is triggered via Spring (beans) defined [dspace]/config/spring/api/core-hibernate.xml.  This Spring configuration pulls in some settings from DSpace Configuration, namely all Database (db.*) settings defined there.
    • All DSpace Object Classes provide a DAO (Data Access Object) implementation class that extends a GenericDAO interface defined in org.dspace.core.GenericDAO class.  The default (abstract) implementation is in org.dspace.core.AbstractHibernateDAO class.
    • The DSpace Context object (org.dspace.core.Context) provides access to the configured org.dspace.core.DBConnection (Database Connection), which is HibernateDBConnection by default.  The org.dspace.core.HibernateDBConnection class provides access the the Hibernate Session interface (org.hibernate.Session) and its Transactions. 
      • Each Hibernate Session opens a single database connection when it is created, and holds onto it until the Session is closed.  A Session may consist of one or more Transactions. Sessions are NOT thread-safe (so individual objects cannot be shared between threads).
      • Hibernate will intelligently cache objects in the current Hibernate Session (on object access), allowing for optimized performance.  
      • DSpace provides methods on the Context object to specifically remove (Context.uncacheEntity()) or reload (Context.reloadEntity()) objects within Hibernate's Session cache.
      • DSpace also provides special Context object "modes" to optimize Hibernate performance for read-only access (Mode.READ_ONLY) or batch processing (Mode.BATCH_EDIT).  These modes can be specified when constructing a new Context object.

Most of the functionality that DSpace uses can be offered by any standard SQL database that supports transactions.  However, at this time, DSpace only provides Flyway migration scripts for PostgreSQL and Oracle (and has only been tested with those database backends).  Additional database backends should be possible, but would minimally require creating custom Flyway migration scripts for that database backend  While Flyway is automatically initialized and executed during the initialization of DatabaseManager, various Database Utilities are also available on the command line..

Maintenance and Backup

When using PostgreSQL, it's a good idea to perform regular 'vacuuming' of the database to optimize performance. By default, PostgreSQL performs automatic vacuuming on your behalf.  However, if you have this feature disabled, then we recommend scheduling the vacuumdb command to run on a regular basis.

...

  • After restoring a backup, you will need to reset the primary key generation sequences so that they do not produce already-used primary keys. Do this by executing the SQL in [dspace]/etc/postgres/update-sequences.sql, for example with:

    Code Block
    psql -U dspace -f [dspace]/etc/update-sequences.sql
    


Configuring the

...

Database Component

The database manager is configured with the following properties in dspace.cfg:

db.url

The JDBC URL to use for accessing the database. This should not point to a connection pool, since DSpace already implements a connection pool.

db.driver

JDBC driver class name. Since presently, DSpace uses PostgreSQL-specific features, this should be org.postgresql.Driver.

db.username

Username to use when accessing the database.

db.password

Corresponding password ot use when accessing the database.

Custom RDBMS tables,

...

columns or views

When at all possible, we recommend creating custom database tables or views within a separate schema from the DSpace database tables. Since the DSpace database is initialized and upgraded automatically using Flyway DB, the upgrade process may stumble or throw errors if you've directly modified the DSpace database schema, views or tables.  Flyway itself assumes it has full control over the DSpace database schema, and it is not "smart" enough to know what to do when it encounters a locally customized database.

...

Bitstreams also have an 38-digit internal ID, different from the primary key ID of the bitstream table row. This is not visible or used outside of the bitstream storage manager. It is used to determine the exact location (relative to the relevant store directory) that the bitstream is stored in traditional or SRB storage. The first three pairs of digits are the directory path that the bitstream is stored under. The bitstream is stored in a file with the internal ID as the filename.

...

  • Using a randomly-generated 38-digit number means that the 'number space' is less cluttered than simply using the primary keys, which are allocated sequentially and are thus close together. This means that the bitstreams in the store are distributed around the directory structure, improving access efficiency.
  • The internal ID is used as the filename partly to avoid requiring an extra lookup of the filename of the bitstream, and partly because bitstreams may be received from a variety of operating systems. The original name of a bitstream may be an illegal UNIX filename.
  • When storing a bitstream, the BitstreamStorageManager DOES BitstreamStorageService DOES set the following fields in the corresponding database table row:
    • bitstream_id
    • size
    • checksum
    • checksum_algorithm
    • internal_id
    • deleted
    • store_number

...

  1. A database connection is created, separately from the currently active connection in the current DSpace context.
  2. An unique internal identifier (separate from the database primary key) is generated.
  3. The bitstream DB table row is created using this new connection, with the deleted column set to true.
  4. The new connection is _commit_ted, so the 'deleted' bitstream row is written to the database
  5. The bitstream itself is stored in a file in the configured 'asset store directory', with a directory path and filename derived from the internal ID
  6. The deleted flag in the bitstream row is set to false. This will occur (or not) as part of the current DSpace Context.

This means that should anything go wrong before, during or after the bitstream storage, only one of the following can be true:

  • No bitstream table row was created, and no file was stored
  • A bitstream table row with deleted=true was created, no file was stored
  • A bitstream table row with deleted=true was created, and a file was stored
    None of these affect the integrity of the data in the database or bitstream store.

...

The above techniques mean that the bitstream storage manager is transaction-safe. Over time, the bitstream database table and file store may contain a number of 'deleted' bitstreams. The cleanup method of BitstreamStorageManager goes BitstreamStorageService goes through these deleted rows, and actually deletes them along with any corresponding files left in the storage. It only removes 'deleted' bitstreams that are more than one hour old, just in case cleanup is happening in the middle of a storage operation.

...

Configuring the Bitstream Store

BitStores (aka assetstores) are configured with [dspace]/config/spring/api/bitstore.xml

Configuring Traditional Storage

By default, DSpace uses a traditional filesystem bitstore of [dspace]/assetstore/

To configure normal traditional filesystem bitstore, as a specific directory, configure the bitstore like this:

...

This would configure store number 0 named localStore, which is a DSBitStore (filesystem), at the filesystem path of ${dspace.dir}/assetstore (i.e. [dspace]/assetstore./)

 

To It is also possible to use multiple local filesystems. Key0 In the below example, key #0 is localStore at ${dspace.dir}/assetstore, and key1 key #1 is localStore2 at /data/assetstore2. Note that incoming is set to store "1", which in this case refers to localStore2. That means that any new files (bitstreams) uploaded to DSpace will be stored in localStore2, but some existing bitstreams may still exist in localStore.

Code Block
<bean name="org.dspace.storage.bitstore.BitstreamStorageService" class="org.dspace.storage.bitstore.BitstreamStorageServiceImpl">
    <property name="incoming" value="1"/>
    <property name="stores">
        <map>
            <entry key="0" value-ref="localStore"/>
            <entry key="1" value-ref="localStore2"/>
        </map>
    </property>
</bean>
<bean name="localStore" class="org.dspace.storage.bitstore.DSBitStoreService" scope="singleton">
    <property name="baseDir" value="${dspace.dir}/assetstore"/>
</bean>
<bean name="localStore2" class="org.dspace.storage.bitstore.DSBitStoreService" scope="singleton">
    <property name="baseDir" value="/data/assetstore2"/>
</bean>

...

Configuring Amazon S3 Storage

To use AWS Amazon S3 as a bitstore, add a bitstore entry s3Store, using S3BitStoreService, and configure it with awsAccessKey, awsSecretKey, and bucketName. You NOTE: Before you can specify these settings, you obviously will have to create an account on in the Amazon AWS console, and create an IAM user with credentials , and privilege privileges to a an existing S3 bucket.

Code Block
<bean name="org.dspace.storage.bitstore.BitstreamStorageService" class="org.dspace.storage.bitstore.BitstreamStorageServiceImpl">
    <property name="incoming" value="1"/>
    <property name="stores">
        <map>
            <entry key="0" value-ref="localStore"/>
            <entry key="1" value-ref="s3Store"/>
        </map>
    </property>
</bean>
<bean name="localStore" class="org.dspace.storage.bitstore.DSBitStoreService" scope="singleton">
    <property name="baseDir" value="${dspace.dir}/assetstore"/>
</bean>
<bean name="s3Store" class="org.dspace.storage.bitstore.S3BitStoreService" scope="singleton">
    <!-- AWS Security credentials, with policies for specified bucket -->
    <property name="awsAccessKey" value=""/>
    <property name="awsSecretKey" value=""/>
    <!-- S3 bucket name to store assets in. example: longsight-dspace-auk -->
    <property name="bucketName" value=""/>
    <!-- AWS S3 Region to use: {us-east-1, us-west-1, eu-west-1, eu-central-1, ap-southeast-1, ... } -->
    <!-- Optional, sdk default is us-east-1 -->
    <property name="awsRegionName" value=""/>
    <!-- Subfolder to organize assets within the bucket, in case this bucket is shared  -->
    <!-- Optional, default is root level of bucket -->
    <property name="subfolder" value=""/>
</bean>

...

The incoming property specifies which assetstore receives incoming assets (i.e. when new files are uploaded, they will be stored in the "incoming" assetstore). This defaults to store 0.Note: SRB Storage Resource Broker bitstore support was removed with DSpace 6.

Configuring S3 Storage

S3BitStore has parameters for awsAccessKey, awsSecretKey, bucketName, awsRegionName (optional), and subfolder (optional).AccessKey

  • awsAccessKey and

...

  • awsSecretKey are created from

...

...

  • console. You'll want to create

...

  • an IAM user, and generate a Security Credential, which provides you the accessKey and secret. Since you need permission to use S3, you could give this IAM user a quick & dirty policy of AmazonS3FullAccess (for all S3 buckets that you own), or for finer grain controls, you can assign an IAM user to have certain permissions to certain resources, such as read/write to a specific subfolder within a specific s3 bucket.
  • bucketName is a globally unique name that distinguishes your S3 bucket. It has to be unique among all other S3 users in the world.
  • awsRegionName is a region in AWS where S3 will be stored. Default is US Eastern. Consider distance to primary users, and pricing when choosing the region.
  • subfolder is a folder within the S3 bucket, where you could organize the assets to be in. If you wanted to re-use a bucket for multiple purposes (bucketname/assets vs bucketname/backups) or DSpace instances (bucketname/XYZDSpace or bucketname/ABCDSpace or bucketname/ABCDSpaceProduction).

Migrate BitStores

There is a command line migration tool to move all the assets within a bitstore, to another bitstore. bin/dspace bitstore-migrate

Code Block
/[dspace]/bin/dspace bitstore-migrate
usage: BitstoreMigrate
 -a,--source <arg>        Source assetstore store_number (to lose content). This is a number such as 0 or 1
 -b,--destination <arg>   Destination assetstore store_number (to gain content). This is a number such as 0 or 1.
 -d,--delete              Delete file from losing assetstore. (Default: Keep bitstream in old assetstore)
 -h,--help                Help
 -p,--print               Print out current assetstore information
 -s,--size <arg>          Batch commit size. (Default: 1, commit after each file transfer)

/[dspace]/bin/dspace bitstore-migrate -p
store[0] == DSBitStore, which has 2 bitstreams.
store[1] == S3BitStore, which has 2 bitstreams.
Incoming assetstore is store[1]

/[dspace]/bin/dspace bitstore-migrate -a 0 -b 1


/[dspace]/bin/dspace bitstore-migrate -p
store[0] == DSBitStore, which has 0 bitstreams.
store[1] == S3BitStore, which has 4 bitstreams.
Incoming assetstore is store[1]

...