Technical Working Group

Ben Armintor - Columbia University
Chris Beer - Stanford University
Esme Cowles - University of California, San Diego
Dan Davis - Smithsonian Institute
Declan Fleming - University of California San Diego
Neil Jefferies - Oxford University
Adam Soroka - University of Virginia
Andrew Woods - DuraSpace - Technical Team Lead
Zhiwu Xie - Virginia Tech

The working group's charter.

Initial Objective

Given the areas of assessment enumerated below, the Technical Working Group has decided to prioritize and select the top four areas for initial review. The plans for the each of these four areas and their assessment outcomes can be found:

F4 Assessment - Pre-Production

Areas of assessment

REST API
- Are immediate updates required?
- We should version the API independently
  1. This offers multiple backend implementations/optimizations
  2. A. Soroka: I think this requires a stronger definition of the API than currently exists in the form of user documentation. I suggest defining the API as ontology extensions to LDP.
  3. Clarifying and publicizing (formally and informally) the relationship between the Fedora API and LDP.
Performance
1. Read
2. Writes
  1. Many small files
  2. Large files
  3. High throughput
3. Scalable serialization to disk
  - Need to measure scale of load that async serialization can meet
  - Need to clarify async approaches: messaging and sequencers
4. Replication of objects to another repository instance
5. Full re-indexing
6. Full integrity checks
Multi-node / Clustered configurations / ~~Federation~~ Capable
1. High availability
2. Bulk ingest
3. High read loads
- Note: generally need to define what clustering provides (DWD - I suggest that a cluster acts like a single installation in which system state is closely shared among the members. Clusters usually imply a common implementation)
- Federation - (nodes have a common definition for identifiers, interfaces, formats, protocol, business semantics, and policies that permit them to interoperate but otherwise act like independent installations that do not closely share system state. Federation does not need to be a common implementation but implies common governance)
ModeShape
1. Assess persistence approach (i.e. bit-level object and datastream persistence)
  1. Some backup/restore details: Backup and Restore
Evolution-capability - The system permits graceful (incremental) changes without having to perform replacement of large parts of the system in one step
1. The software permits the graceful replacement of old technology with new technology
2. The software permits the integration of new technology gracefully
3. New content formats can be added easily, and the system permits gracefully delivering new representations for existing content
4. New capabilities can be added or old ones replaced gracefully
5. Underlying hardware and software infrastructures can be replaced gracefully, and the system can use advances in technology or special characteristics of its technical infrastructure without changing the core Fedora software
6. How does the content move forward in time?
7. How do the interface contracts move forward in time?
8. How does the implementation move forward in time?
Ability to use in various integration patterns
1. Inbound and outbound transformation
  1. Permits ingested information to be transformed so it matches the supported ingest contracts, and the same in reverse for delivery
  2. Also used internally to support interoperation with back-end integrations particularly storage (for example S3, DuraCloud)
  3. Overlaps the Content Enrichment pattern for feature extraction, for example loading Search and Discovery indices
2. Content Enrichment pattern for ingest at least
  1. For example, extra meta-information can be added to newly ingested content
  2. Another example is extraction of meta-information from inside the ingested content
  3. A third example, is connecting content or meta-information to other related items
3. Internal and external event driven (notification) patterns (especially external notification that an asynchronous operation is complete)
  1. Internal event driven operation is likely to be well set up
  2. A classic external case is a front-end system needs to know when all internal operations or delegated operations are finished so the front-end system can behave is a post ingest fashion. For example, update its indexes, removed staged content, possibly remove original content. The alternative is a polling approach (both could be used).
4. Idempotent receiver pattern - Identical ingests could be received but it should be possible to ignore duplicates
5. Message Bridge pattern - Permits inbound messages (all RESTful HTTP API calls are messages) to signal back-end integrations, possibly outside the repository to perform functions
Storage Options
1. Tiered-storage
  1. Support having all or part of the content low performance storage including copies in near-offline storage
  2. Support having all or part of the content on offline storage (like tape - where items are not available until after staging)
  3. Support having meta information stored in offline or near offline
2. Support storage other than file systems and using that storage's special features
  1. Bytestream-based object stores like S3, DuraCloud or Isilon
  2. Streaming stores for low latency, low dropout functions such as audio and video delivery
  3. Tape
3. Support having specialized indices particularly for locating copies, metadata or discovery data, also removal of latency
  1. Direct queries to appropriate an appropriate index
  2. Marshall results from multiple indices
Preservation-worthiness
1. These comments are based on the assumption that the only form we currently know how to preserve is a serialized form, also some features overlap, If this is not true propose an alternative
2. Permit copies to be made, maintained and validated at one or more geographically remote locations
3. All archivally significant data is, at some point, stored in a serialized form
  1. A. Soroka: What is "archivally significant data"?
4. No notification that results in the destruction of the original source materials is issue until all steps of the preservation policy are executed and verified (A. Soroka: This is as much to say that Fedora's performance will always be terrible.)
  1. e.g. content progresses from a (possibly) non-serialized form, to a serialized from and n copies are made, followed by a check of essential characteristics
  2. There is some definition of the essential characteristics of the representations that can be delivered for the unit of preservation
  3. There is some definition of the unit of preservation
5. Bitstream level fixity of "preserved" representations can be verified
6. Fixity of meta information can be verified
7. Some approach to authenticity is selected and used including at least lifecycle records (one kind of audit record)
8. Records of system operations including configuration changes are kept (a second kind of audit record)
  1. A. Soroka: This is not feasible in the current implementation and making it feasible would require bringing configuration into the repository, a massively-non-trivial task.
9. Repeated from Evolution above since the subjects overlap: How does the content move forward in time?
10. How do the interface contracts move forward in time?
11. How does the implementation move forward in time?
Support for graphs of related stuff (carefully avoiding saying what kind of stuff yet)
1. Linked data
2. Semantic databases
3. Specific representations
4. Named graphs

Page tree

Technical Working Group

Initial Objective

Areas of assessment

Meetings

2 Comments

A. Soroka

Daniel Davis