Design Principles

  1. Minimize change to the user via the API
  2. Retain URLs of migrated Fedora resources
  3. Compliance with OCFL
  4. Do not allow OCFL-isms from bleeding into Fedora API
  5. Rebuildability
  6. Performance
  7. Reduce complexity of implementation

Issues being addressed

Based on feedback from users of Fedora 4 and 5, the design for the next major release of Fedora will address the following issues:

Preservation persistence

The notions of "completeness" and "transparency" are important when it comes to how a preservation repository persists its resources (metadata and binaries) to storage. The resources in storage should be "complete" in the sense that Fedora should be able to rebuild its indexes based on what is stored on disk as files. The documented transparency of those persisted files also allows for other applications to consume those resources. See section below: OCFL Persistence.

Query service

The ability to query Fedora for basic information regarding the contents of the repository has been a missing feature in Fedora 4 and 5. This design will include a simple query service for inspecting all of the Fedora resources or resources based on specific attributes. See section below: Query service.

OCFL persistence

Architecture

  1. Retaining HTTP layer of existing Fedora codebase
  2. Replacing ModeShape persistence with OCFL storage
  3. Support for three interaction models: 
    1. atomistic (implicit) - every LDP resource maps to an individual OCFL Object
    2. archive group - hierarchy of LDP resources map into a compound OCFL Object
    3. archival-part (implicit) - an LDP resource that is a constituent part of a compound OCFL Object
  4. Eliminate "single-subject-restriction", i.e. support arbitrary RDF
  5. Fedora-specific information to be stored in the OCFL Object in a ".fcrepo/" directory
    1. i.e. Which file is the description of another file
    2. i.e. Which file is an ACL
  6. Optimizing reads/lookups with an internal database
    1. proposed database model: https://docs.google.com/document/d/1MsMfhae3thmNdoFtnTUnII3mr_-OkllRs9PvgnY1fDY/edit
  7. Support for both OCFL storage hierarchies:
    1. created by Fedora
    2. created by another application (pre-existing)

Mapping between LDP and OCFL

Opt-in model

  1. Fedora resources may be created with an optional "archive group" interaction model provided via headers

  2. New resources created via POST or PUT to the archive group will be LDP contained by the archive group and will be stored within the OCFL object representing that archive

  3. If a resource is created without the "archive" model, new resources created via POST or PUT will be LDP contained by the parent resource, but will be stored as separate OCFL objects

  4. Note: user establishes interaction model at creation time. Changing the model would require additional migration tooling.

Bulk ingest

  1. Faster ingest rates can be achieved by users writing OCFL-compliant content directly to disk
    1. Would require Fedora to (re)scan OCFL storage hierarchy
  2. Optionally, user could write into OCFL-compliant storage in a way that includes Fedora optimizations (e.g. ".fcrepo/" directory)

Open questions

  1. Role of OCFL storage roots
    1. Could be valuable for multi-tenancy, but client interaction model has not been detailed
  2. What is the mapping / algorithm / relationship between:
    1. Fedora URL of LDP resource
    2. OCFL Object.ID
    3. OCFL storage path for associated OCFL Object

Implementation notes

  1. Provide new implementation of fcrepo-kernel-api that interacts with OCFL persistence
  2. Interactions with OCFL persistence should initially take advantage of the JHU OCFL client
  3. For pre-existing OCFL storage hierarchies, Fedora-imposes the following constraint:
    1. The OCFL storage hierarchy must have a single, consistent "ocfl_layout" (i.e. the storage path mapping algorithm must be determinant)
  4. Many members: performance should improve significantly since list of members will be supplied by a database index (which should support a degree of in-memory caching)
  5. Deleting tombstone of OCFL Object purges the Object

  6. Deleting tombstone of "constituent part" is not supported (405)

Prototyping proposal

  1. Expose JHU OCFL client functionality with minimal HTTP endpoints
    1. Such an endpoint should implement minimal LDP interactions
  2. Use HTTP over OCFL to test:
    1. Performance bottlenecks
    2. Scale viability (e.g. NLM migration)
    3. User expectations, ergonomics

Versioning

  1. Support for two versioning models:
    1. version an object on-demand (manual versioning)
    2. version an object on-change (auto-versioning)
  2. Support for toggling auto-versioning on/off
  3. One-to-one correspondence between OCFL versions and mementos
  4. For archive groups, any new version of the OCFL Object captures current state of the entire archive group

Versioning on-demand

  1. Same as Fedora 4 and 5 version creation: POST to a resource's "/fcr:versions" endpoint to create a Memento (i.e. a new OCFL version directory)
  2. Actively edited objects captured in a "cache/" directory at the sibling-level with OCFL version directories

Versioning on-change

  1. Every update to a Fedora resource results in a new OCFL version directory
  2. Potential downsides:
    1. Potential storage impact
    2. Potentially creates "noisy" version history
    3. Note: Transactions could mitigate "noisy" version history by grouping multiple updates in a single commit

Implementation notes

  1. Same code logic used for creation of OCFL versions / Mementos in both on-demand and on-change models
  2. LDP resources within a compound object should respond with a "Link" header pointing the the TimeMap of the Fedora "archive group" resource
  3. POST on /fcr:versions of part resources returns a 400 response

  4. GET on /fcr:versions returns a version of the "constituent part"

Migration from lower versions of Fedora to higher

  1. Design
    1. Release import/export tool for each version of Fedora (4, 5, 6)
      1. Import/Export tool for a given release is able to round-trip content for that release
    2. If necessary, transform exported serialization produced from one Fedora version to the a serialization that is expected for the import Fedora version
      1. May be able to transform F3? 4, 5 directly to F6-OCFL on-disk serialization

  2. Fedora resource URLs must remain unchanged during migrations
  3. Persistence model of Fedora 6 should be stable enough to eliminate the need for a content migration to Fedora 7

Fixity service

  1. Requirements:
    1. Check fixity of binary resource(s) by comparing computed value with stored value
    2. Check fixity of binary resource(s) given a specific set of Fedora object rdf:types
    3. Persist results of fixity check
      1. In log file?
      2. In database?
      3. In Fedora?
  2. Scheduled fixity service:
    1. Probably not part of the core
    2. Run as a separate service (see: Riprap)
    3. Potentially implemented as a circular queue of Fedora resources, ordered by "last fixity check" date property on Fedora resource
  3. Retain "fedora:hasFixityService" triple or header
  4. At the OCFL-level, interest in providing fixity over an OCFL storage hierarchy

Query service

  1. Should also consider this "Query Service Specification"
  2. Proposal: Query service / endpoint should support the following queries:
    1. List all resources
    2. List resources by mimetype
    3. List resources by parent
    4. List resources by mimetype, parent, and modified date (<>=)
    5. List resources where modified  <> x date
  3. Open questions around scope of resources to be searchable
    1. Fedora resources?
    2. Resources defined in RDF documents within the repository?
    3. Hash URIs?
  4. Open questions around properties to support
    1. Server-managed triples?
    2. All properties?
  5. Triplestore not necessarily required

Implementation notes

  1. Index of all Fedora resources would be needed to support the query service
  2. Messaging model (synchronous or asynchronous) would likely be used to populate the index
  3. Full-text search would be a bonus

Transaction service

  1. Proposal: no change to the Fedora API spec in 6
  2. We will either:
    1. align code with the (as-yet-to-be-ratified) side-car specification
    2. leave HTTP API unchanged while introducing the possibility of auto-versioning on transaction completion
  3. Potentially store updates within a transaction in a "txn/" directory at the sibling-level with OCFL version directories
  4. Support actions on multiple OCFL objects within a single transaction

Raw notes

  1. General VA Beach Meeting notes
  2. Design summary notes
  3. Migration notes
  4. Object modeling notes
  5. Versioning notes
  6. Fixity notes
  7. Bulk ingest notes
  8. Query service notes
  • No labels

8 Comments

  1. In regard to OCFL persistence, as I commented elsewhere, I am not convinced that making OCFL the primary storage platform for binaries and metadata is a good choice.

    First off, there is no separation between binaries and metadata. This may be a problem. E.g. I may want to store metadata in a fast SSD and binaries on a more capable SAN or cloud store, and as I understand it, I can't do that the way OCFL data are structured.

    Second, while I have no concrete proof for this, I am doubtful that any implementation writing metadata in a filesystem would be as efficient as using a database (but feel free to surprise me!). I understand that a database would be used for indexing, but that will be yet an additional write that needs to be done (and verified, lest the indices may get off sync).

    Finally, while I think implementing OCFL is important, I don't think it's advisable to impose it to all Fedora adopters by making it the primary and mandatory storage method.

    My suggestion would be to make the database that is currently proposed as the "index" database the primary metadata store, and a plain filesystem hierarchy similar to the current one the primary binary store. A good DB and filesystem implementation would provide plenty of assurance for mid-term preservation as well as performance. The OCFL layer should be an optional, additional, asynchronous store for long-term preservation. Mechanisms can be developed to notify if an OCFL object update fails, to help ensuring integrity. The primary database and binaries can be backed up with different schedules and strategies, and the OCFL layer could be rebuilt from these two, AND the other way around.

    By keeping this clear separation we would keep Fedora modular and more adaptable to users' needs. Other than that, I welcome the effort to implement OCFL if that addresses preservation concerns of some of our stakeholders.

    1. You make a fair point about potential performance impacts with a file-based persistence layer. The approach we are targeting is to start with OCFL persistence and optimize from there. As you suggest, the optimization could quite likely be able to stand on its own... and we will try to design in that direction.

      Regarding the co-location of binaries and metadata, would "external content" meet your use case of storing binaries in a different location?

      1. It could, but does that not void the purpose of OCFL by having content outside of Fedora's control and external to the OCFL framework?

        1. The Fedora API will still be implemented, along with external content.

          As you are suggesting, not everyone is necessarily concerned with the OCFL sensibilities.

          1. But then if you have no option outside of OCFL and you have externally stored content,  you end up stuck with a broken OCFL implementation and you get the worst of both worlds: you are forced to write all metadata to disk, and you can't interact with the OCFL layer because it is incomplete.

            1. OCFL defines a way of persisting digital objects to disk/cloud that is transparent and supports versioning. If there are pointers to binaries in another location, I don't think that equates to a "broken OCFL implementation", but rather one possible design that prioritizes a particular set of use cases. No matter what Fedora were to implement as its persistence layer, external content has the same effect of separating elements of a digital object across storage systems.

              1. If OCFL allows pointing to non-local files and consider them managed under its structure, including for sanity check purposes, I guess that would work.

                My main, broader question is,however: do we have enough feedback from our constituents to be confident that they want to be locked into this technology? Will this setup improve trust in Fedora for the most; or will it be perceived as a burden?

  2. I would be wary of letting other systems write to the OCFL layer directly. The main concern I have is that as we know, no two implementations of a standard are exactly the same. There are always variations and slangs.

    So, if I have other OCFL-compliant software writing to the same structure that Fedora will read, there is a chance that at some point Fedora may interpret some MAY or SHOULD in the specs differently than the other implementations, resulting in resources not being visible, misrepresented or even overwritten.

    Read-only services should be safe to access OCFL directly, so that could be encouraged.