Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

 

Introduction

Recommendations towards “Updating the Qualified Dublin Core registry in DSpace to the latest standards of the DCMI,” a priority identified in the October 2011 community survey on improving metadata support. It also seeks to comply with the proposal to Standardize the Default Namespace

 

Note: In addition to the child page of mappings linked below, see the grandchild pages, "Samples and decision points for mappings." and " Proposed phased schemas"

Glossary of Terms

Main goal of these recommendations

  • The ultimate goal of these recommendations is to implement DCTERMS as the default metadata schema, thus ensuring compliance with current standards endorsed by DCMI.

Possible Phases of Update

Ultimate goal = 'dcterms' schema as metadata registry default schema

For proposed phased schema changes see: "Proposed phased schemas"

Phase One: Add schema "dcterms" to the DSpace 4.0 default registry as an added schema (not the default schema)

  • Schema "dcterms" is added to DSpace 4.0 for testing.
  • Provide documentation.
  • Curation task for metadata migration added to 4.0.

Phase Two: Change default schema from "dc" to "dcterms". The "dc" schema is updated. "local" and "dspace" schemas are added to default registry

  • Develop and implement flat "dcterms" schema as the default DSpace schema.
  • Develop and implement "local" schema.
  • Develop and implement a DSpace admin/internal metadata schema - "dspace"
  • Ship DSpace with "dc," "dcterms," "dspace," and "local" schemas in metadata registry.
  • Create new Application Profile for DSpace.

  • Address all areas of code affected (search/browse, import/export, crosswalks, hard-coded 'dc' elements/qualifiers, etc.). Resolve issues with features that rely on metadata solutions (i.e., Creative Commons, RequestCopy, embargo, etc.). Consider plugins, add-ons, interfaces, etc. as well.
  • Lock down the "dcterms" schema from the UI.
  • Provide tools for current DSpace repositories to migrate to these schemas (i.e. edit their metadata registry and data), if desirable (i.e. provide tools for migrating elements not compliant with dcterms to "local" registry).
  • DSpace add-on for remote query to profile the metadata usage of DSpace repositories added.
  • Provide documentation.

 

Outstanding issues for committers and community

  • Is it possible to ultimately implement DCTERMS with full functionality (vocabularies, etc.)? What changes to the data model will be necessary?
  • How will this proposal integrate with other suggested changes to DSpace metadata, including Proposal for Metadata Enhancement? How might it affect integration with Fedora? How might it affect other desired changes to metadata in DSpace, including implementing functional structured metadata such as MODS, METS, and PREMIS?
  • What challenges will this proposal present—or solve—for harvesting?
  • To enable repositories to migrate existing metadata to the DCTERMS schema, we will need to develop robust tools for repositories to deploy. (Note: A curation task has been added to 4.0.)
  • Should DSpace admin/internal metadata (not including DIM) have its own schema ("dspace"), or use 'local' schema?

Recommendation background

The original DCAT Discussion forum topic that lead to this proposal can be found at "Updating the Qualified Dublin Core registry in DSpace."

  • Update current default 'dc' schema in DSpace metadata registry to current standards

  • Add DCTERMS as new, parallel schema in the default metadata registry

    • Background:
      • DCMI has not updated its Qualified Dublin Core standard since 2005. The community standard has shifted towards DCMI Metadata Terms, which, unlike QDC, is not a flat schema based on the schema.element.qualifier format. DCTERMS include range and domain values. A particular term may link to another term that it refines or is refined by (for example: the dcterm "hasPart" refines "relation"; "created" refines "date").

    • Rationale:
      • DCTERMS is the currently maintained DCMI standard.
        • As Sarah Shreeves recently commented:
          "I want to strongly urge the group to look at conforming with DCMI terms (http://dublincore.org/documents/dcmi-terms/) - even if we can't conform to the vocabulary, etc, this is the most up to date and current form of the namespace. If we use the dc qualifiers document we will be perpetuating the same problem, IMO. I think we can, as Tim suggests, have a graceful path forward. I will admit that a real part of my fear of just moving to DC Qualified is that DSpace--in terms of metadata--will continue to be seen as out of touch with where much of the metadata world is headed."

        • Also, from http://dublincore.org/documents/dces/:
          "Since 1998, when these fifteen elements [dc: namespace] entered into a standardization track, notions of best practice in the Semantic Web have evolved to include the assignment of formal domains and ranges in addition to definitions in natural language. Domains and ranges specify what kind of described resources and value resources are associated with a given property. Domains and ranges express the meanings implicit in natural-language definitions in an explicit form that is usable for the automatic processing of logical inferences. When a given property is encountered, an inferencing application may use information about the domains and ranges assigned to a property in order to make inferences about the resources described thereby.Since January 2008, therefore, DCMI includes formal domains and ranges in the definitions of its properties. So as not to affect the conformance of existing implementations of "simple Dublin Core" in RDF, domains and ranges have not been specified for the fifteen properties of the dc: namespace (http://purl.org/dc/elements/1.1/). Rather, fifteen new properties with "names" identical to those of the Dublin Core Metadata Element Set Version 1.1 have been created in the dcterms: namespace (http://purl.org/dc/terms/). These fifteen new properties have been defined as subproperties of the corresponding properties of DCMES Version 1.1 and assigned domains and ranges as specified in the more comprehensive document "DCMI Metadata Terms" [DCTERMS].Implementers may freely choose to use these fifteen properties either in their legacy dc: variant (e.g., http://purl.org/dc/elements/1.1/creator) or in the dcterms: variant (e.g., http://purl.org/dc/terms/creator) depending on application requirements. The RDF schemas of the DCMI namespaces describe the subproperty relation of dcterms:creator to dc:creator for use by Semantic Web-aware applications. Over time, however, implementers are encouraged to use the semantically more precise dcterms: properties, as they more fully follow emerging notions of best practice for machine-processable metadata."
       

  • Lockdown schemas offering migratory tools to pull out local customizations and push into new local schema. Make it possible but not easy to delete or edit elements in DCTERMS schema. Continue to enable the addition of qualifiers in the 'dc' schema.

  • For staging purposes, we recommend that DSpace ship with 4 registries in Phase 2, to support ultimate migration to DCTERMS and to standardize namespaces by pushing local customizations not compliant with DC or DCTERMS into a local schema.:
    • 1) 'dcterms' (DCTERMS) - which will be the default metadata schema
    • 2) 'dc' schema
    • 3) 'dspace' schema for system/admin metadata
    • 4) 'local' schema - which would ship with some elements migrated out of 'dc' because not compliant with QDC, and enabled for the purpose of local customizations

Relevant JIRA tickets

(please add any JIRA tickets that could be affected by this proposal!)

Would be RESOLVED

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

RELATED/Would be AFFECTED

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

  • Not sure whether this is related. But I assume if DCMI requires some properties to be unique (for example identifiers), I guess you would need a generator to ensure unique identifiers get generated.  

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Areas/processes that will be affected by registry update

What areas and processes will be affected by these shifts? Is there any documentation of what features in DSpace are making use of certain fields? Where will the code be affected? Where are metadata elements hardcoded?

(pulled from September 4, 2012, DCAT discussion)

  • Any processes that create new metadata in DSpace:
    • submission forms
    • spreadsheet importer
    • command line import
    • SWORD
    • built-in OAI Harvester
  • Any process that displays metadata in the web user interface:
    • item pages
    • search, browse, DSpace discovery
  • Any process that delivers the metadata (potentially via crosswalks) to other applications:
    • OAI server
    • REST API

 

 

  • No labels

16 Comments

  1. As repository manager (I'm technically proficient but not a programmer) , one of the issues we have with the existing DSpace metadata schema is that

    DC isn’t sufficient for describing publication items (i.e., research outputs). We needed to provide a solution for a complex project that involved melding the existing repository content which contains mainly theses content to becoming the institutional collection for research outputs. We implemented a metadata crosswork between our dspace repository and the research outputs management system to transfer data from one system to another. Because Dublin Core doesn’t support metadata at article level (e.g., start-page, end-page etc), we had to create local schema for the crosswork to achieve interoperability with other research management systems which is not ideal.

    We need something more granular and beyond the idea of ‘core metadata’ for simple and generic resource description.

    Can we please have some feedback/comment about the above from the cmtr? 

    Many thanks. 

    Yanan 

    P.S. current “DCMI Terms” metadata registry definitely needs updating. 

  2. I agree with Yanan. DSpace as it is does not support granular metadata. At the same time, the simple structure of element, qualifier and authority makes it easy to extend the metadata set and adapt it to our own needs.

    The customization of the metadata format has been done by the community in different ways, in most of the cases looking for the same goal: the extension of the existing qualified DC or the creation of a new schema with two levels (element and qualifier). This extension is necessary for defining granular information, like conference name, location, start and end page, etc... This granularity can then be easily used for exporting using other formats like MODS and MARC, while it is also available for import from existing databases or through reference managers.

    The actual metadata formatting is simple and flexible. However, it is based on old DC definitions which does not work for harvestable standards beyond DC. Internally the simplicity should be preserved, but still there is a need of applying richer metadata standards. All the extra elements (see examples above), which now have been defined in many ways, should be standardized. Tools to rework the granular elements should be available to create different metadata formats (in the first place for harvesting), not only as a translation of DC qualified as it is now.  All the existing values should be available for harvesting. Not only the elements, but also  the authority and language values. In my opinion,  the implementation of authority which contains unique identifiers (ISSN, DOI and surely URI – related to linked open data), could turn out as the most important development of metadata in DSpace, but which is at the moment not translated to harvestable metadata.

    The main functionalities of a repository should be the use of a submission module to collect content and the delivery of quality metadata useful for being completely harvested with all the meaningful values. In Type is an important structuring element for metadata, which should be better supported in the submission interface. There is generic metadata, but besides that different types (book, book section, journal contribution, interview, …) have specific elements. That should help to define the necessary granularity. There is also metadata available from databases in different formats (e.g. RIS, Bibtex). They are more granular than the DC qualified used in DSpace. This should be resolved too.

    These ideas are based on our experience with the development of OceanDocs and AgriOcean Dspace. Gradually, we became convinced that we needed a better handling of metadata than a basic DSpace can offer. We worked therefore on three levels:

    • Adaptation of the submission module, using a type-based submission interface. For every type only the relevant fields are shown
    • Creation of extra elements to refine the metadata: e.g for a journal reference: journal name, volume, issue, start and end page. We simply extended the existing dc qualified, not bothering to create a new schema for our extra elements. For us, it is simply an internal presentation which should be translatable to standards. For consistency, we concatenated some of the fields in existing DC qualified fields.
    • Extending the crosswalk tools for exposing metadata formats in OAI. First, all the values in the metadata value table can be used. Reformatting tools makes it possible to create rich metadata. AgriOcean DSpace supports metadata formats like MODS, VOA3R AP and AGRIS AP. We also start to use authority values containing URIs for AGROVOC terms as attributes in MODS, but as element values in VOA3R AP. It is on this level that we have to follow standards, which go beyond DC and DC translated to other formats.

    For me, it proves that for internal use a simple model can work. I agree that updating is necessary. This update should foresee granular approach more standardized, where adaptation and extension should still be possible. DSpace can only provide good quality metadata by using a good submission module. Finally, crosswalk tools are needed to translate internal metadata to rich standard formats (in the first place for OAI harvesting – and, as second stage, for exposing Linked Open Data).

  3. Not sure I understand all of the discussion (I'm sure that I don't), but I support the move to DC-terms. Not sure why an intermediate step for DC qualified is needed, seems like a retro move.

    Thanks everyone for their work on this,

    Robin Rice

  4. Hi DCAT:

    During the discussion of this agenda at OR2013 at the DSpace 'committers' meeting on Monday, I volunteered to provide some tool assistance to facilitate the program. I have completed a draft of the first tool, but before I offer it as a patch to the codebase, I wanted to make sure it addressed the basic needs (mostly phase 1 stuff, but could be generally useful). Please let me know if there is functionality not described here that would be valuable. Here's a description of the 'MetadataMapper' tool:

    Basics: it is written as a curation task, so it can be deployed to any DSpace version 1.8 or later, ie. without waiting to upgrade the DSpace instance. It might make sense to bundle it with 4.0, however, so that it comes 'right out of the box'.

    Functions: The user defines a set of desired metadata transformations in a simple map:

                    dc.contributor.author -> dc.creator

                    dc.embargo.terms -> dspace.embargo.terms

                   ....

    This map is placed in a config file read by the curation task, which will then take all the metadata values found (if any) in the left side and move them to the right side.

    Move means that they are copied from the source to the target, then deleted from the source. As with all curation tasks, these move operations can be done to a single

    Item or Items (one by one), to all Items in a collection, to all items in a community, or to the whole repository. You can run the task as many times as you like, either in the Admin UI (Manakin only) or from a command line.

    The tool can add some special handling to these operations depending on how the metadata has been set up. There are 3 cases:

    (1) Replacement . this means that whatever is on the right side is removed and replaced with what is on the left

    (2) Merge - this means that the left side values are added to the right side, but any existing right side values are preserved

    (3) Assignment - this means that the left side is moved to the right only if there is nothing on the right side

    Using these in combination, I think you can do most things you intuitively want to, like combining 2 fields into one new one, etc As a safeguard, you can run the task in 'preview' mode, which will display what operations it would perform, but not update the Item.As with any task, you can (if run from the command-line) capture all the specific changes to a file for later reference. The info provided looks like this (one line for each item):

        1721.1/123 dc.contibutor.author (3) merged with dc.creator (4)

    This means the tool copied 3 values from contributor.author into creator, which previously already had 4 values.

    Let me know if this sounds like it will cover what we need as far as Item metadata (I realize there are a lot of other issues, like input-forms, crosswalks, etc)

    Thanks,

    Richard



  5. Thank you, Richard. This sounds really well thought out to me, between the levels at which the curation task might be applied, the option of previewing, and the capture of changes. 

    Two questions:

    1. Does this process assume that, prior to deployment, repository managers will add and enable any new metadata elements included in the mappings? Or is that somehow built into the curation task (I'm assuming not)?
    2. Would it be possible to include, in addition to the number of values copied, the values themselves? i.e., 1721.1/123 dc.contributor.author (3) [x, y, z] merged with dc.creator (4) [a, b, c, d] ?

    Thanks again,

    Sarah

     

     

  6. Hi Sarah:

    Thanks for the review. Re questions:

    1. That's right - it expects the metadata fields to have been defined. I considered automatically creating them from the mappings, but thought that it would make it too easy for typos to accidentally create unintended fields. The task does, though, verify before running that the right-hand fields exist, and complains if it doesn't find them (The left-hand fields are OK in the sense that if they are typos, it will never find data to move, so are innocuous).  I'm imagining as part of the schema migration(s) we will publish new registry XML files, and there is already a loader for them.
    2. It would be possible, although the output could be rather sizable if we are moving abstracts, etc in large collections. I thought that since the field values were being preserved, we didn't need to record them. However, come to think of it, there is one case where that isn't true: in cases of assignment where the right-hand is already occupied (the left-hand data is then essentially discarded). I'll look into capturing those field values, in that case.

       Another reason one might want to log the values is if they are changed in any way. I didn't mention it before, but the task does also have a simple transformation capability, meaning that before adding to the right-hand field, one can twiddle with the value a bit. An example would be:

         10.5004/dwt.2010.1079 -> http://dx.doi.org/10.5004/dwt.2010.1079

    That is, turning a DOI into a canonical URL (or vice versa, for that matter)

    In this case, one might want to record the pre-transformation value, I suppose.


    Richard

  7. Hi Richard:

    Curious what the use case might be for assignment, where left-hand data is discarded is the right-hand is occupied? Are you thinking of this as a way to run a check to ensure that the data has transferred? 

    I agree with your setup wherein the registry fields are already defined rather than somehow established or created within the migration tool.

    In addition to a transformation capability, there are use cases around a validation capability for the tool-- one that will alert users if they are transferring non-compliant data into a field. 

    Thanks again,

    Sarah

    1. You might consider the alternative of having a separate validation task, since you might want to run that by itself in other cases.  If you happen to be mapping, you could separately validate the old MD beforehand, the new MD afterward, or both.  This seems to me like a case in which two simple tools beat one more complex tool.

      1. Mark-- I completely agree. 

    2. Hi Sarah:

      Assignment is meant to be a sort of 'safe replace' or 'choose best value' operation. If the right-hand field has been newly created, then merge and replace do the same thing - just copy values into it. (This will be the overwhelmingly most common case). If there are values present however, one has to decide what the relationship of them are to the left-hand values: should we combine them, since we are basically cross-cataloging in two fields? This may make sense sometimes, but typically only if the field is multivalued. Should we discard the right-hand side? If so, use replace. Suppose though, we have begun cataloging into the right side field, but not bothered to remove any superseded values in the left-side (not cleaned up past practice) In this case, neither merge nor replace seem right - thus assignment. It essentially means "if there is a value there, it's the one I want to keep".

      Make sense?

      BTW - I concur with Mark Wood on validation as an independent concern, thus meriting a different tool

  8. The need for additional metadata suited to specific uses of DSpace seems to me to be precisely the reason that DSpace was designed to support multiple namespaces.  Sites which archive images of pottery will have different needs than sites which archive chemical research reports or musical performances.  I think that DSpace could and should ship with additional namespaces which could be loaded by sites that need them.  It won't ever have everything that everyone wants, because people are endlessly creative in identifying new wants.

    There are several distinct needs in this area, I think:

    • DSpace needs some namespace it can rely on for basic operations without any customization.  DC has filled that role and DCTERMS may continue to do so.
    • DSpace has some concepts of its own that have been shoved into "DC".  But DCMI defines what DC means, not DSpace.  These should move to an internal namespace which is not exposed, since they have no meaning to other systems.
    • Each site may have some concepts of its own as well.  They should be collected into one or more local namespaces to facilitate preventing the leaking of information which is meaningless outside the site.
    • Types of materials may have unique needs.  If one site has these needs, probably others do too.  The first thing to do is to look around and see if there is already a suitable metadata standard, and if so use it.  Otherwise ask around to see if it's feasible to hammer out a standard among sites with similar needs, to publish and share.  Otherwise, it's probably really local concepts and should go in a local namespace.

    One thing that is asked for often is article-level metadata.  Almost as frequently, someone points to PRISM as an answer.  I can't say whether it's a good answer, but it's an example showing that what you want might already have been standardized.  Don't work more than you have to!

    I feel that too much metadata customization for DSpace takes place in the dark rather than being discussed and shared.  One of the things I hope for from this metadata renovation is that that will change.  Oddly enough, DSpace arguably makes it entirely too easy to deal with (some) metadata issues by just tweaking the default namespace and moving on.  We haven't done enough to encourage reliance on the community, not just of DSpace sites but the broader community of networked information resources.

  9. Richard, do you think the tool might be ready for 4.0? We'd be interested in looking at and testing the tool whenever the code is ready (and giving feedback).

  10. Hi Sarah:

    Yes it should. It is already written as described, but I wanted to make sure it met the basic needs before committing to the codebase. As to testing, what environment do you have available? If you have a 1.8+ DSpace (has to be XMLUI if you want to run the task in the admin UI), I can probably send you code to test right now.

    On a related note, I'm pondering another tool/service to assist metadata improvement. I think we are somewhat stymied by a basic lack of visibility into exactly how individual sites have customized (or not) their metadata. WIthout this, it's hard to devise automation tools that work for large numbers of sites. To address this knowledge gap, I'm thinking of providing a web service add-on to DSpace that one could remotely query which would profile the metadata usage. What does profile mean? It means listing all the schema that have been defined, and within each schema, listing all the defined metadata fields and how much they are used (= how many MD values in this field in the entire repo, regardless of item, etc). You could 'harvest' these profiles from all participating sites, and combine the results (kind of like OAI-PMH harvesting) to get an aggregate picture of metadata usage. What do you think?

    Richard

  11. Hi, Richard 

    The proposed profiling metadata usage web service sounds really useful (especially from repository admin perspective). 

    We had to customise our metadata (i.e., adding additional metadata schema) to integrate our DSpace repository with University's Research Management system. As our repository grows bigger and bigger, it would be very useful to know which metadata fields have been used heavily and which ones have not been used, to understand implication of changes to crosswalk etc. 

    Regards, 

    Yanan 

  12. Hi Richard,

    I am cheering from my desk at your suggestion of a web service to remotely query and profile metadata usage in existing DSpace repositories. Our lack of a comprehensive picture of the fields actually in use in DSpace repositories has been a stumbling block. And one that doesn't seem resolvable through intermittent self-reporting in the form of something like surveys. I love your idea.

    Sarah