January 2012 Meeting, Temple University

Components that DuraSpace will build

Provisioning Process

Steps and notes
  • Obtain a DfR account
    • Associated with a Duracloud account (multitenancy).
    • Belongs to the institution or funder at the departmental level
    • May need to be enabled / streamlined with inCommon accounts for Internet2 customers.
    • Standard Duracloud setup of user profiles, billing, spinning up instances, initializing the instances
    • Must spin up the Fedora instances, start auto user
      • Auto user runs scheduler to start the conversion and cloudsync services
      • Design question about placing conversion or cloudsync services on the fedora instances (conversion), or at the duracloud level (cloudsync).
    • Smallest config is 2 instances (duracloud and fedora), largest config is instance per service, plus elastic durastore and fedora with components (triplestore, lucene) split on own instances.
      • Could also provide fedora per researcher account / profile  to help with scaling and division of content.
How to Build

As an add-on to the management console. Note that spinning up DSpace and Fedora in the cloud is a similar model.  Possibly should apply the REST API model to the management console (see Jira issue) as part of this process.

  • Next Steps
    • The DuraCloud Feature Requests and UI Designs
  • DuraCloud Feature Requests:
    • Autouser implementation
    • Create DfR accounts
    • Management console extensions or plugins for DfR users; Upgrade Management console to REST?
  • UI Design Needs:
    • DfR Account creation

Sync utilities

Not covered yet:  We need to report on the results of the sync from the DfR server side.  Collecting the messaging from the DuraStore side to generate a log for later reporting.  Possible to add a tag to identify the watcher and the transaction to the DuraStore calls for later filtering.

Specification for sync services / interfaces

How do folks write new sync tools? Candidate specs for a sync tool:

  • Put: Use DuraCloud APIs to push the data up to DfR; Authenticated by Shib?
  • Provide a manifest file in XX format that provides extra metadata
  • Analyze/Configure: Accept selection or filter criteria in XX format that guide where to look on the filesystem, what type of files to upload, what metadata to add.
  • Trigger: Accept recommendations on frequency of monitor / trigger frequency.
  • Support SSL connections.
  • Support accepting a key and encrypting the data before it leaves.
  • Provide a "watcher" with Heartbeat support that calls the Monitor and accepts a recommendation on frequency.
  • Store authentication locally and send it via Shib or other.
  • TBD: outline of how to push the data back when needed.
How to Build
  • Next Steps
    • Design / specify an initial manifest file
    • Design / specify the configuration format (selection and filter criteria).
  • DuraCloud Feature Requests:
    • Shib authenticated REST API for DuraStore
    • Logging of activity
  • UI Design Needs:
    • Design of Client Interface - file system watcher (configuration - what directories on what schedule; operational status; results)

Write the specifications!

Central Monitor Server / Sync Service Settings

Need a rest endpoint for the heartbeat call.
Decide if we add a distinct call to get the settings, or use a convention and the standard existing endpoint.
Provides notification if the heartbeats don't arrive as expected.
N.B. calls to move data go directly to the DuraStore APIs.

How to Build

As a Duracloud service.  This might be deferred until a later development cycle.

Shibbolize the DuraCloud and DuraStore APIs.

UI for sync service settings
  • Generate selection / filter criteria
  • Specify automatic metadata to add
How to Build
  • Next Steps
  • DuraCloud Feature Requests:
  • UI Design Needs:
    • Design of Client Interface
Researcher Workstation Sync Utility
  • Write an SDK
  • Provide a set of interfaces and concrete implementations of same.
  • Needs to support encryption and Shib.
  • Expect that we will allow multiple machines to sync to the same space; we will not provide guaranteed semantics around resolving conflicts.
  • For more general sync capabilities, use another tool for now.  We'll pull from one of the nodes with access to all files.
How to Build

As a command line Java application?

  • Next Steps
  • DuraCloud Feature Requests:
    • Refactor the existing sync tool to create an  SDK.
  • UI Design Needs:
    • Design of Client Interface - file system watcher (configuration - what directories on what schedule; operational status; results)
Remote Data Sync (box.net)

Open question: Should this be box.net or dropbox, or something else?

How to Build

TBD; Possible implementations include:

  • Could be via REST calls to box.net to explore what the researcher has placed (do gets and sync up). This pattern is easier to replicate for other providers. Possibly as an extension of CloudSync?
  • Could be DfR writes a plugin for box.net that appears in the box.net UI login and can be pressed by the user.
  • Have box.net sync to a node that also syncs to duracloud? No software development required. this is preferred for now

Conversion service

(N.B. Apply some actual use cases to this).
Assume there is a pipeline/chain of converters that know how to convert various specific incoming formats; if none in the chain know how to handle, drop to a default converter that makes opaque Fedora objects. It is this default converter that we will focus on initially.  We want to hang onto implicit data and relationships (like the directory structure), and bring them in via the conversion service.

Specification format for conversion Rules

Should be informed / based on the Islandora / Smithsonian models. Should take a look at Aaron Birkland's conversion pipeline, existing ETL tools, the FAST search and access tool.  The Harvard work, DROID, etc.

How to Build

Explore examples of likely directory structures we would want to import, examine how Smithsonian would map and handle, and propose the criteria / heuristics that could be applied.

  • Next Steps
    • Research the prior art above
UI for Conversion rules
How to Build
  • UI Design needs
    • Analyze the models for expressing the conversion rules

TBD, after we understand the conversion Rules.

UI for conversion services management (job status, resource allocation, rerun after settings change)
How to Build

UI connected to the REST services provided by the webapp specified below.

  • UI Design needs
    • Work out what this should look like.
Duracloud space to fedora object conversion service
How to Build

Write as a Webapp, deployed as a DuraCloud service. There is no prior art in DuraCloud for this type of service, so we need to think hard about it.

  • Next Steps
    • Design the conversion pipeline
    • Write the webapp shell for the conversion service

Cloudsync enhanced for DfR

Need to add a sync task instead of copy -> make a "one-way" than determines changes and sends them
Need a scheduling capability, either internal or external.
Needs access for credentials from where it reads and writes.
Needs a way to export / import configuration information (or use existing json calls).
Need to capture the state in case a restart needs to happen.
Possibly place on EBS to help with upgrades and hold state.
Shibbolize?
This is a sysadmin function / role.

How to Build
  • Write additional modules for Cloudsync. (sync and scheduling capabilities)
  • Write the wrappers for Duracloud to install cloudsync and send in the credentials (provision).
  • Write the wrappers for DfR to control / report on cloudsync.
  • Explore Shibbolizing / ??
  • UI Design
    • Integrate the Cloudsync design into the overall DfR interface design
  • DuraCloud Feature Requests
    • Persist state on services (for cloudsync running within DuraCloud on a restart)
    • Autouser for scheduling cloudsync runs.

DuraCloud with Shib ( & OAuth/OpenId?)

Ideally we add Shib to both the REST and user interfaces in Duracloud as a systemic cross cutting concern. We will need to have a linkage from shib identities to duracloud user profiles for auth. An open question is what do when there is no shib - should duracloud run it's own shib rather than existing id stuff?

Overall, solve the question of CAS or Shib or Spring Security or all / none of the above.

Steps
  • Research options for packages & libraries
  • Develop a model for DuraCloud UIs
  • Develop a model for DuraCloud REST interfaces
  • Stand up a shib dev environment or choose external test servers.
  • Slot into the DuraCloud priority list
  • Upgrade existing known REST clients
How to Build

TBD

Access control and Accounts

On the incoming side, we will have multiple "space A" for each Fedora to allow for distinct read/write controls on each incoming area. There will be a single "space B" for each Fedora; each Fedora represents a research project owned by a researcher.

Need to look into central control at two endpoints - the islandora interface, mapping the drupal user/Profile/capabilities to the Shib Id, and at the fedora level using xacml tied to profiles tied to the Shib Id.

We would like to have a solution that provides identity as a centralized cross cutting concern across all DuraCloud and DFR interfaces, both user and REST. This solution should be readily integratable (E.G. Spring Security as an example), and yet have a compatibility path for InCommon (and/or other Shibboleth environments). It is highly desirable to be compatible with public systems as well (OpenId or OAuth).

Sub components to possibly develop:

  • Specification for access control / sharing
  • UI for access control and account creation
Steps
  • Work with Fluid to develop the user experience of the access control.
  • Research the Islandora Drupal to Fedora XACML security model.
  • Discuss possibilities with Unicon and evaluate solutions.
  • Develop the Duracloud use of that model or other model.
  • Layer on additional DfR requirements on the model
  • ???
  • Profit!!!!
How to Build
  • Next Steps
    • Do the research
  • UI Design:
    • Fluid to investigate

TBD

Encryption (point of creation, Duracloud, islandora, elsewhere)

N.B. All models support wrapping transport in SSL encryption. Design assumption is asymmetric keys (pub/private), but could use symmetric keys (shared secret).

Model 0: Not encrypted
Model 1: Encryption by the user, opaque to system
Model 2: Encryption at the clients (beyond endpoints) and opaque to the core system. Keys provided by user to our tool.
Model 3: Encryption at endpoints within DfR; DfR has the key available
Model 4: Encryption at storage; DfR has the key available.

Assuming Model 2 or Model 3 - (Space A is Encrypted and we have the key):

  • Sync needs to handle encrypted and unencrypted files; can do this via storing checksums of the unenc for comparison
  • How do we safely ingest encrypted files? By decrypting a temporary copy inside the object creation service
  • How do we safely index encrypted files?
  • What is the model for decrypting on legitimate access?
  • Full text indexing is a trade off with encryption security.
  • Having an unencrypted staging area is a tradeoff between performance and security and needs consideration.

Space B (everything in Fedora would not be encrypted).

How to build
  • Client tool to have encryption library linked in, and facility for receiving the encryption key from the user or the core system.
  • Fedora Object creation service to have encryption library linked in, and facility for receiving the decryption key from the core system.
  • UI Needs
    • ?????

Unreviewed Component List

  • UI to manage Duracloud services and push results / errors to a visible location - sysadmin and possibly user (condition green vs red)
  • Restore of ODS from DfR - as a function of the sync tool?

Collaborative components

  • Common data models across Smithsonian / Islandora/ DfR
  • (UI) Islandora for DfR visualizations and annotations and ...
  • How do we tell Islandora about new objects and how do we make them readable - Islandora will just find them.
  • How do we share authentication (shib?)
  •  

Future requirements

  • Multi-tenancy in Fedora
  • Multi-instance scalability of Fedora

Follow up actions

  • Find some researchers and sample data
  • Work out multitenancy in DuraCloud
  •  
  • No labels