Dial In Details

Date: Monday August 17, 11am EDT (-4 UTC)

Attendees

Items for Discussion

  1. Use Case for Indiana University Libraries: 
    1. Asynchronous Storage
    2. HPSS-based tape storage for digital preservation (Scholarly Data Archive, SDA)
    3. Fedora 4 Federation / Projection across SDA
    4. HPSS ModeShape Connector for SDA
    5. Items loaded directly into SDA then available/ingested into Fedora 4
  2. Other issues:
    1. Fixity Checks and other preservation actions
      1. Fixity checking in Fedora 4 vs in SDA
      2. hooks into fixity checking and hooks into SDA
    2. Efficiency of single file requests vs. batch requests
    3. Batch processing:
      1. ‘intent to get’ followed by the ‘get’
      2. ‘intent to stage’
      3. ‘intent to purge’
    1. Use Glacier? DPN? other sources?
  3. Software Development for Asynchronous Storage

Minutes

  • Indiana Goals for this call
    • Define parameters of development and parts of Fedora we’ll need to modify to implement asynchronous storage
    • Preliminary scoping of a focused development effort around asynchronous storage that IU would contribute to
      • Would be good to scope this as a community effort - broad participation
  • Amherst: needs re: storage in F4
    • Putting together F4 with a number of different storage backends
      • Projected storage over local network filesystems
        • Already supported
      • Need large segments of files stored in Amazon S3
        • Need to segment portions of different repos to different S3 buckets
          • Will make it possible to charge back for storage use
        • Can already connect to an S3 bucket in ModeShape/Infinispan
        • Primarily a synchronous interaction model (ingest/retrieval)
        • Files would all be binaries. When pushed into Fedora it would be replicated out to S3
        • There is a limit for single uploads in S3 - large files would need to go in as a series of single operations and then recomposed into a single object
      • Cold-storage backup
        • Assemble an object together into a bundle and push into Amazon Glacier as an individual archive
        • This would happen asynchronously and Fedora would not be able to retrieve the object directly
        • Could retrieve and process through an external processing chain
    • Policy-driven storage use case - a resource coming into Fedora provides a hint that determines which back-end storage system it is sent to
      • For GET requests, a property on the resource or something similar would provide the hint (rather than the client, which may not know where the resource is stored)
  • Indiana use case
    • Based on HPSS tape storage (Scholarly Data Archive - SDA)
    • Large media digitization and preservation initiative - many large files into SDA
    • Need federation/projection to work across SDA
    • Need to build HPSS ModeShape connector
    • Access via F4 - either ingested or projected
    • Some resources (containers with metadata, access binaries) would be in F4 and binaries would be stored in HPSS
  • Other issues
    • Fixity checking
      • Already doing fixity checking within SDA - how will this work with F4 fixity?
    • HPSS offers greater efficiency with batch requests vs. single file
      • Might want to indicate desire to GET a batch of files and then execute the GET as a batch
  • Other technologies
    • Glacier, DPN
    • Ideally the connector would be adaptable to a variety of technologies
  • Need to use Servlet 3.1 specification to support asynchronous interactions
  • Implementation
    • Multiple implementations may draw out additional community support
    • Intent to GET, followed by GET
      • Could be done with API extension architecture
      • Possible to find out when the file is available and alert Fedora - Fedora is only involved when the file is available
      • A property on the container indicates that the binary is only available asynchronously
      • May not need a connector, though we would need to account for access control
    • Indiana favours a design where the client is not aware of the serving destination of the file - this would be handled by middleware
    • Best way forward: API extension architecture?
      • Bring this use case to the next meeting
    • Reach out to the community to see who else has a similar use case
  • No labels