Dial In Details
Date: Monday August 17, 11am EDT (-4 UTC)
- URL: http://connect.iu.edu/dam2/
- Phone: 812-856-7060
- Passcode: 227815#
Attendees
Items for Discussion
- Use Case for Indiana University Libraries:
- Asynchronous Storage
- HPSS-based tape storage for digital preservation (Scholarly Data Archive, SDA)
- Fedora 4 Federation / Projection across SDA
- HPSS ModeShape Connector for SDA
- Items loaded directly into SDA then available/ingested into Fedora 4
- Other issues:
- Fixity Checks and other preservation actions
- Fixity checking in Fedora 4 vs in SDA
- hooks into fixity checking and hooks into SDA
- Efficiency of single file requests vs. batch requests
- Batch processing:
- ‘intent to get’ followed by the ‘get’
- ‘intent to stage’
- ‘intent to purge’
- Use Glacier? DPN? other sources?
- Fixity Checks and other preservation actions
- Software Development for Asynchronous Storage
Minutes
- Indiana Goals for this call
- Define parameters of development and parts of Fedora we’ll need to modify to implement asynchronous storage
- Preliminary scoping of a focused development effort around asynchronous storage that IU would contribute to
- Would be good to scope this as a community effort - broad participation
- Amherst: needs re: storage in F4
- Putting together F4 with a number of different storage backends
- Projected storage over local network filesystems
- Already supported
- Need large segments of files stored in Amazon S3
- Need to segment portions of different repos to different S3 buckets
- Will make it possible to charge back for storage use
- Can already connect to an S3 bucket in ModeShape/Infinispan
- Primarily a synchronous interaction model (ingest/retrieval)
- Files would all be binaries. When pushed into Fedora it would be replicated out to S3
- There is a limit for single uploads in S3 - large files would need to go in as a series of single operations and then recomposed into a single object
- Need to segment portions of different repos to different S3 buckets
- Cold-storage backup
- Assemble an object together into a bundle and push into Amazon Glacier as an individual archive
- This would happen asynchronously and Fedora would not be able to retrieve the object directly
- Could retrieve and process through an external processing chain
- Projected storage over local network filesystems
- Policy-driven storage use case - a resource coming into Fedora provides a hint that determines which back-end storage system it is sent to
- For GET requests, a property on the resource or something similar would provide the hint (rather than the client, which may not know where the resource is stored)
- Putting together F4 with a number of different storage backends
- Indiana use case
- Based on HPSS tape storage (Scholarly Data Archive - SDA)
- Large media digitization and preservation initiative - many large files into SDA
- Need federation/projection to work across SDA
- Need to build HPSS ModeShape connector
- Access via F4 - either ingested or projected
- Some resources (containers with metadata, access binaries) would be in F4 and binaries would be stored in HPSS
- Other issues
- Fixity checking
- Already doing fixity checking within SDA - how will this work with F4 fixity?
- HPSS offers greater efficiency with batch requests vs. single file
- Might want to indicate desire to GET a batch of files and then execute the GET as a batch
- Fixity checking
- Other technologies
- Glacier, DPN
- Ideally the connector would be adaptable to a variety of technologies
- Need to use Servlet 3.1 specification to support asynchronous interactions
- Implementation
- Multiple implementations may draw out additional community support
- Intent to GET, followed by GET
- Could be done with API extension architecture
- Possible to find out when the file is available and alert Fedora - Fedora is only involved when the file is available
- A property on the container indicates that the binary is only available asynchronously
- May not need a connector, though we would need to account for access control
- Indiana favours a design where the client is not aware of the serving destination of the file - this would be handled by middleware
- Best way forward: API extension architecture?
- Bring this use case to the next meeting
- Reach out to the community to see who else has a similar use case