Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

Notes/Discussion on the current AIPBackupRestorePrototype Implementation

Crosswalking in AIPs

Although using METS based AIPs seems like a nice, neutral format (which could allow you to easily move content to other systems), is it the "best" format for a DSpace AIP?

Richard R has some concerns that the roundtrip crosswalking could end up being lossy. So, in a normal backup & restore, we'd go through two crosswalks:

  1. Export = Crosswalk all DSpace objects into a METS-based representation
  2. Restore = Crosswalk a METS-based representation back into a DSpace Object

If the crosswalks are not kept in sync, the final restored DSpace Object may not be the same as the initial DSpace Object. This becomes even more problematic for institutions which have created their own custom metadata fields, bundles, etc. If the crosswalks don't understand how to deal with that content, it's possible some of it could be lost during the restore process.

Perhaps in the end, we need to determine if there's a better way to serialize this DSpace content. Or, maybe you can "choose your serializer" and decide whether you'd rather serialize your AIPs using METS or a different packaging format (TBD).

Inter-dependencies between AIPs

Current AIPs have too much interdependency. Parent objects (e.g. Collections) enumerate all of their children (e.g. Items). This means that every time a new child object (e.g. Item) is added/removed, it also must be added/removed from all of its parents' AIPs.

Based on discussions below, it looks like we currently have come up with 4 options (at least in the short-term). Feel free to add to these, if you think of other options or pros/cons:

  1. Allow Collections/Communities to enumerate their children (this is how the AIPs are currently formed in the prototype)
    • Pros
      • Makes partial-restores (restoring a Collection/Community) a bit easier – just restore the Collection/Community AIP and it then tells you what child AIPs are necessary to restore
    • Cons
      • Adding a new child object also changes the parent AIP. AIPs are not as independent.
  2. No enumeration of children in AIPs + local AIP parser
    • Pros
      • AIPs are independent.
      • Would work fine when restoring an entire site (or just a single item).
    • Cons
      • Local AIP parser is great as long as AIPs are stored locally. If the AIPs are actually stored elsewhere (whether in DuraCloud or in any other backup solution), then restoring a single Community or Collection is more complex. If the parser is local, then nearly all AIPs may need to be copied to local storage to be parsed – so that it could be determined if the AIP belongs to the Community or Collection being restored.
  3. No enumeration of children in AIPs + remote AIP parser (in DuraCloud, etc)
    • Pros
      • Same as #2 – in addition, now the remote parser can decide which AIPs need to be pulled down locally (so that you only need to copy the AIPs to local storage that you really need).
    • Cons
      • May be DuraCloud specific? Other backup solutions (to tape, external drive, offsite storage) may not be able to take advantage of an external parser.
  4. No enumeration of children in AIPs + a site "index" (which details all relationships)
    • Pros
      • Again, relatively simple partial-restore process (like #1) – In this scenario you just pull down the site "index" file to determine which AIPs are needed to fulfill the restore.
      • AIPs remain independent of one another
    • Cons
      • Could be semi-"proprietary" to DSpace? In other words, would other systems understand this file? But, do we care? If the AIP export is used by someone to migrate to another system, e.g. Fedora or similar, then they would likely be loading all AIPs, and have no usage for the "index" file in any case.
      • Although AIPs remain independent, any changes in relationships (e.g. adding a new object, moving an item) require updates to this "index" file as well – probably, not a big deal, but it's worth mentioning as well.

-------

[15 April 2010] We (Richard R, Bill H, Tim D) decided that child objects should enumerate their parents (so you can find an Item's parent Collection from that Item's AIP), but parents should not enumerate all their children. Although this may make restoring content more complex (in order to restore a Collection, you need to look at each Item to determine if it is a child of that Collection), it will lessen inter-dependencies between AIPs.

------
[16 April 2010 - Tim Donohue] I realized we may need to rethink this decision. If there is no way to determine children of parents easily, than you may encounter the following less-than-ideal scenario when restoring a single Collection along with all its Items:

  • Suppose all your AIPs currently take up 1TB of space. Likely, nearly 90% of that space (900GB) is for Item AIPs, as they tend to be larger and more frequent than Community or Collection AIPs.
  • Suppose you also want to restore a single Collection.
  • Since you know the Collection you need to restore, obviously you can immediately restore the Collection metadata from the Collection AIP
  • However, if the Collection AIP does not enumerate its Items, you will be stuck having to parse 900GB of Item AIPs to determine which belong to this Collection. This becomes even more inefficient if you are using a service like DuraCloud, as it will force you to download 900GB of Item AIPs in order to unzip them and determine which belong to this Collection.

This scenario makes me think we either need Collection AIPs to continue to list all Item members, or we need another way to relatively easily "lookup" which Items belong to that Collection.
-------

[01 June 2010 - Mark Wood] It's not necessary to parse entire Item AIPs since they are ZIP archives; just read the manifests. If they are stored remotely (e.g. DuraCloud) then you need to be able to run the parser there and send back the lists of interesting items.

On the other hand, we could extract the relationships into an index for each Collection and package that separately. Relationships are not part of the things related – the difficulty lies in trying to shove the relationship inside any one of the related entities.
-------

[01 June 2010 - Mark Diggory] I recommend considering this from the ORE aggregation style "standpoint". what we vaguely concluded a couple years ago is that a DSpace Collection is not an ORE Aggregation because it is open ended. ORE Aggregations are Finite, thus a DSpace Item as an ORE Aggregation will enumerate its children while a DSpace Collection will not. I support the idea of not listing all the child Items in a parent collection AIP or the collection aips within the parent Community AIP. The original behavior of the AIP prototype's ability to reconstitue a repository community/collection/item hierarchy based on the contents did require fully traversing the repository to discover the ancestry of any one Community, Collection, Item, Bundle or Bitstream AIP. Being able to traverse the manifests without actually having gzip archives of content in bitstream will give us the capability to do this efficiently. Perhaps there should be a means within the asset-store to separate the AIP manifests from the rest of the bitstreams so that they may be traversed quickly.

This very much makes me think of both the Fedora Store and the Semantic Store project and how we will address the subject of Entities for DSpace Communities, Collections, Items and Bitstreams. IMO, DSpace 2.0 Entities and AIPs are highly correlated, Where an AIP is an Archival Representation of all or part of an Entity. Likewise, Services in the DSpace Service framework may be seen as different views/subsets of data/state of the content you refer to as an AIP.
-------

What content goes in an AIP?

Several questions about what content should really be stored in an AIP.

Does AIP include derivatives (e.g. thumbnails, extracted text files) or just DSpace CONTENT Bundle?

Decision: We need to have a includeBundle and excludeBundle option on the AIP generation process. That way, individual institutions can choose which derivatives (or other content) they feel should be in AIPs. By default, we will just export the CONTENT and LICENSE bundles.

Is there a way to link to repetitive CC licenses (and similar repeated content), and avoid storing them within many AIPs at once?

Obviously, if you have the same CC License attached to 300 items, it'd be better to have those 300 Item AIPs to link to a single CC License file (to save on storage space), rather than repeating that CC License in all 300 Item AIPs.

Although this is a nice concept – it will require further investigation.

Do we store EPeople, Groups and ResourcePolicies in AIPs?

Tim feels that ResourcePolicies (DSpace access rights) might need to be stored eventually in some way. Otherwise, when restoring an Item, you'll always have to default to making it publicly available (or default to Collection access rights).

Mark feels that restoring an object without its rights is too surprising to contemplate, but that this is separate from the question of whether and how to export roles.

Do we store Embargos in AIPs, so that we can restore them as needed?

These are currently stored in metadata in 1.6. So, in a way, we're already storing them by default.

However, perhaps the ingest process for an AIP will need to check for embargoes if we feel they are worth restoring.

Restore from AIPs

We need to review the current restore process more closely, to ensure it is doing what we expect it to do. Previously it was built to support the AipPrototype (with internal AIPs).

  • No labels