Unreleased Documentation

This documentation is unreleased and still in development. It may describe features which are not yet released in DSpace.
Looking for another version? See all documentation

Overview

This feature adds basic duplicate detection to DSpace by comparing normalised item titles in Solr with a configurable levenshtein edit distance allowing for fuzzy matching of potential duplicates.

Duplicates can be searched via a submission step, to warn submitters or editors of potential duplicates while they are editing metadata for an in-progress item, and also with a new /duplicates REST item link which will search and retrieve a paged list of potential duplicates for any item.

Workflow reviewers / editors will also get a warning for claimed and pooled tasks indicating the total number of potential duplicates.

The feature must be enabled in configuration (see below). It is disabled by default.

When enabling this feature for the first time, a full discovery reindex must be performed with ${dspace.dir]/bin/dspace index-discovery -b.

Examples of default duplicate display in the DSpace frontend:

Preview of a potential duplicate in item submission

Warning of 1 potential duplicate for a pooled task.

Configuration

Configuring Basic Duplicate Detection

To enable Basic Duplicate Detection and configure its parameters, edit $[dspace.dir}/config/modules/duplicate-detection.cfg and uncomment or adjust the default properties accordingly.

The default configuration is shown below.

Property:

duplicate.enabled

Example Value:

duplicate.enabled = true

Informational Note:

This setting enables or disables the entire duplicate detection feature.  When changing the value you MUST reindex the site (./dspace index-discovery -b)

If the value is not true, any requests to the duplicate detection REST endpoints or section data will be an empty list (the search will not be performed) and item signatures will not be indexed.

Default: false

Property:

duplicate.signature.normalise.lowercase

Example Value:

duplicate.signature.normalise.lowercase = false

Informational Note:

Specifies whether the metadata used in the fuzzy match for duplicates should be lowercased at index and query time.

This is recommended to help keep the edit distance used in fuzzy search predictable and in line with typical user expectations.

Default: true

Property:

duplicate.signature.normalise.whitespace

Example Value:

duplicate.signature.normalise.whitespace = false

Informational Note:

Specifies whether the metadata used in the fuzzy match for duplicates should have all whitespace stripped at index and query time.

This is recommended to help keep the edit distance used in fuzzy search predictable and in line with typical user expectations.

Default: true

Property:

duplicate.signature.distance

Example Value:

duplicate.signature.distance = 2

Informational Note:

Specifies the maximum edit distance between the two item "signatures" (normalised titles). This value is appended to the Solr term query with the ~ operator.

For more information see https://en.wikipedia.org/wiki/Levenshtein_distance

A distance of 0 is an exact match (not including any case or whitespace differences as per normalisation rules above)

Default: 0

Property:

duplicate.signature.field

Example Value:

duplicate.signature.field = item_signature

Informational Note:

Specifies the Solr field name to use when indexing the normalised value for fuzzy duplicate matching. This field name should end in signature to ensure that the expected Solr schema field type and rules are used.

It is not recommended to change this field name.

Default: item_signature

Property:

duplicate.preview.metadata.field

Example Value:

duplicate.preview.metadata.field = dc.title

duplicate.preview.metadata.field = dc.date.issued

Informational Note:

Specifies the item metadata field(s) to include in the duplicate match object, which will be displayed to users. Customise this list of fields to suit your preferences and metadata privacy requirements.

Default: 

duplicate.preview.metadata.field = dc.title

duplicate.preview.metadata.field = dc.date.issued

duplicate.preview.metadata.field = dc.type

duplicate.preview.metadata.field = dspace.entity.type

To display previews of potential duplicates in item submission, you will need to enable the step as per below

Configuring Basic Duplicate Detection in Item Submission

To include a submission section that displays a list of potential duplicates to item submitters and editors,

By default, the "Basic Duplicate Detection" step is disabled.  To enable it, simply update your item-submission.xml to include this tag in your <submission-process>:

<submission-process name="traditional">
   <!-- This step enables preview of potential duplicates for the in-progress item -->
   <step id="duplicates"/>

   ...

</submission-process>

After making this update, you will need to restart your backend (REST API) for the changes to take effect.

You will also need to enable the overall Basic Duplicate Detection feature in DSpace configuration as per above.


  • No labels