The functionality is largely inspired by the SOLR official de-duplication approach, for each item one or more signatures are computed using pluggable implementation.
A signature is a value that summarize the information in the item using a pluggable transformation (case insensitive, ascii transcription, identifier normalisation, etc), out of box implementation based on a normalization of a single metadata (such as an identifier or the title) or a combination of metadata (such as title + year, etc.) are included.
Two items are flagged as potential matches if they share at least one signature.
Feedback on potential matches (reject or duplicate flag) are stored in the database table dedup_reject
Signatures and matched groups are computed when an item is updated and stored on a dedicated SOLR core this make extremely fast and lightweight to check for potential duplicate. This SOLR core is maintained using DedupEventConsumer a script DedupClient is provided to rebuild the index or build it the first time if you are migrating from a previous version.
Two functionalities have two point of interaction with the users
Since DSpace-CRIS 5.10 |
A batch script is provided to merge different instances of a cris object in a single one. The script works on any kind of entity (researcher profiles, organisation units, projects, etc.) with the following rules
usage: ScriptMergeCrisObject -d,--delete delete merged objects, the default (without the -d option) is to disable them -h,--help help -m,--merge <arg> CRIS ID(s) to merge into the target (use multuple m if needed - merge occurs respecting the order from left to right) -p,--replace_notempty <arg> properties to override in the target with the values from the merged objects IF NOT EMPTY -r,--replace <arg> properties to override in the target with the values from the merged objects -s,--skip properties to ignore during the merge -t,--target <arg> CRIS ID to retain (merge target) -x,--exclude Don't merge properties, only move link from the merged object to the target USAGE: ScriptMergeCrisObjects -t <crisID> -m <toMergeCRIS-ID1> m <toMergeCRIS-ID2> .. m <toMergeCRIS-IDn> [-r propR1 -r propR2... -r propRN] [-p prop1 -p prop2... -p propN] [-s]