Transparent persistence, or human-readable persistence, is the practice of keeping a copy of repository contents as files on disk.

Rationale

Different users have different rationales for wanting to access repository content as files on disk, such as:

  • Making it easier to use disk-based tools and workflows
  • Reducing the technology stack and skills required to recover repository content

Scenarios

There are a few different scenarios for keeping a copy of repository content on disk and keeping it in sync with the repository:

  • The copy on disk is the only copy of the data, used by the repository as the primary storage
  • The copy on disk is an additional copy of the data, updated synchronously during request processing
  • The copy on disk is an additional copy of the data, updated asynchronously, e.g., by receiving JMS events and retrieving repository content
  • A disk-like API is provided using FUSE or a similar tool that allows disk-based tools to work with the repository directly

Role in preservation

Having a copy of repository content on disk may enable a preservation workflow, but it is not a preservation strategy by itself.  So transparent persistence is "preservation-enabling", allowing a disk-based preservation workflow to easily access the repository content.

Existing functionality

  • fcrepo-serialization can be configured to serialize metadata as RDF files on disk.
  • Using the default configuration, files larger than 4KB are stored on disk named after their SHA1 digest.  So these files are already on disk, and can be matched with their associated metadata records using the SHA1.

Requirements

  1. Fedora 4 resources shall be persisted as exploded BagIt bags, in a directory tree separate from the repository's primary storage
  2. The directory structure of the Bags shall have a discoverable and predictable relationship with the resource's repository URL
  3. RDF resources shall be persisted to disk in a client-defined RDF serialization, from the following options: application/ld+json, text/rdf+n3, application/rdf+xml, or text/turtle
  4. NonRDF resources shall be associated with their respective RDF resources by the following optional modes:
    1. copying the NonRDF resource to the Bag's data directory
    2. hard-linking from the Bag's data directory to the NonRDF resource in the repository's primary storage (requires the Bag and repository storage to be on the same filesystem)
    3. sym-linking from the Bag's data directory to the NonRDF resource in the repository's primary storage 
    4. creating a manifest with NonRDF repository URLs (holey bags)
  • No labels

7 Comments

  1. For requirement #2 - serialization format, I would personally prefer if the Fedora team chose one preferred serialization format and provided that as a default.  As individuals try to collaborate across institutional boundaries, I think it's going to be easier if everyone can make some assumptions about how they look at and talk about serialized data.  If you can add hooks to support client defined serializations, fantastic, but start with one and get as much energy behind it as you can.  (NOTE: since they're all just serializations of an underlying RDF graph, I'm assuming it wouldn't be too hard to convert between them - so maybe you only need to support a single serialization period?)

    1. Yes, we are talking about different RDF serializations between which conversion should be readily possible.
      Thoughts, Unknown User (escowles@ucsd.edu)Michael J. Giarlo?
       

      1. I am ambivalent on this one and defer to Unknown User (escowles@ucsd.edu) and others.

        1. Unknown User (escowles@ucsd.edu)

          I agree that the RDF serializations are all equivalent, but still people have strong feelings about them.  I think it would be fine to standardize on Turtle as the default format (since that's what F4 uses by default), but let people configure JSON-LD or RDF/XML or whatever if they really wanted to.

  2. For requirement #3 - directory structure - I'm not sure if you're using "logical" in the connotative sense or the strict algebraic sense. Was this requirement was met by FC3? And therefore should it be a minimum requirement of FC3?  There needs to be a "discoverable" or "predictable" relationship, i.e. given resource URL x, I can call function f(x), which yields path y to the resource serialization on disk, but "logical" seems to initially imply something grander than that.

    1. Wording of #3 has been updated.

  3. For requirement #5 - non-RDF resources - it seems like 5d (or 5c) functionally matches what I could get out of FC3.  I'd personally prefer not to have two copies (a) of non-RDF resources floating around (especially if I start trying to version them...)