Fedora Architecture Summit

Cornell University, Ithaca, New York
March 14-16, 2007

A meeting with leading members of the Fedora community to discuss the future of Fedora's architecture.

Presentations

All presentations are available online here

Action Items

NOTE: Prioritization is indicated by percent of attendees that voted an item to be: Essential, Highly Desirable, or Nice-to-Have. For example, "90/10/0" translates to ninety percent said it was "Essential," ten percent said it was "Highly Desirable" and zero percent said it was "Nice-to-Have." The items below are roughly ordered from highest to lowest priority based on this scheme.

Transaction Management for Create/Read/Update/Delete (i.e., CRUD) Operations Upon a Graph of Related Digital Objects

90/10/0

This requirement makes the observation that increasingly the "entity of management" for Fedora is a graph of related digital objects, as opposed to a single digital object. This is due to the reality of how digital objects are being modeled by organizations - deconstructing entities in constituent parts, and let each part be independent entity. We see more examples of compound entities made up of interrelated digital objects (i.e., "atomistic approach," "graph-oriented content models," "networks of objects"). The current Fedora APIs and management modules were designed to manage a transaction for a single operation on a single digital object (e.g., ingest an object, modify a datastream in an object).

A simple example of this requirement exists around ingesting a compound entity such as an article object, where the text is one digital object, each figure is a separate "image" digital objects, and accompanying data is its own digital object. The requirement is to ingest all constituent parts as one composition, and if something fails on one component part, we want to roll back the ingest of the whole thing (to avoid having an incomplete/broken object). Another example is found in cases where humanities scholars are working on a set of digital objects representing a text, associated annotations and interpretive analysis. In such cases, a scholarly task can target several related objects. To support this, we must ensure that a scholar's modifications to multiple objects in a network are committed together (i.e., to prevent the work from existing in an unintelligible state) and, also, we must prevent two or more scholars from committing incompatible changes to the same objects (i.e., to enable reliable editing of the work).

Currently, Fedora implementers deal with this via custom middleware that is designed to manage specific transactions around pre-defined "content models" or information models for their applications. The summit group put forth a new requirement that Fedora provide native support for CRUD operations upon digital object compositions that are a graph of related objects. There are different ways to accomplish this that should be investigated. A simple way forward might be to enable Fedora to receive a message/request that encodes a "batch" of API-M service requests to be executed as one transaction. In this scenario, you can imagine submitting a set of API-M requests that interleave read and write operations that pertain to the creation or updating of a compound entity make up of several related digital objects. If something fails in executing the set of operations, the whole set of operations will be rolled back. The idea is to create a "transaction boundary" that encompasses the whole composition of related objects. Other approaches might be built around the notion of a content model (needs more investigation) and have a new API operation that accepts a SIP that describes a graph of related objects. (Note relationship to ORE work here). Note that this scheme could be reduced to operate on a single digital object, since a graph could just have one node.

Faster Performance for Retrieving Graphs of Digital Objects

70/30/0

Devise ways to facilitate faster access/retrieval of digital object graphs, including the retrieval of datastream content (byte streams). One aspect of the problem is to avoid an explosion of Fedora API callback requests such as "getDatastream." An example of where this arises is in Topaz when a digital object retrieval requires lots of Fedora API-A calls to put together a "unit of viewing" or "unit of management" which is a graph of objects - and it is desirable/necessary to pre-fetch all content, as opposed to a more lazy approach to callback for content as needed. In such a case it is desirable to avoid lots of SOAP/REST requests, each returning independently. One idea is to have a way to return a "DIP" of the whole graph with all the binaries in it (SOAP with attachment or multipart MIME). From a pure API perspective, there is the possibility of new Fedora API operations:

GET OBJECT (enhance the current API operation). The reduction of the problem is to push the "graph interpretation" outside of the Fedora repository but to improve the implementation of the existing getObject operation. This operation could be enhanced to enable the getting one digital object with all of its datastreams as attachments.

GET AN OBJECT GRAPH: possible implementation options:
- New Verb for Fedora API-A : "getObjectGraph(cmodelType, RootPID); returns some encoding of the graph with binaries attached;. Alternatively, can have option to retrieve with by-reference to binaries. Assumes some notion of a content model to define the boundaries of graph. How else? This approach hides the implementation strategy of how to gather the graph according to desired boundaries. Could we return an ORE object (with attachments; by reference)?
- getObjectGraph(RootPID, filter1, filter2, filter3). Psuedo-query-oriented but returns the graph with binaries based on some filter constraints. This is really the same as getObjectGraph with a series of constraints as arguments.
- A disseminator could be used to return some expression of a graph of related objects (presumably based on a content model definition, which would define graph boundaries). The disseminator could be attached to the root/parent object. The disseminator could reflect on a content model and determine how many levels of object-to-object relationships to traverse to put together the dissemination results. Note, the format of the dissemination result (the expression of the graph of objects) is up for grabs. It could be custom, or ORE, or something else?
- GET SETS OF OBJECTS: there can also be a new API operation that will return a set of objects (given a list of PIDs as input). The same concept can be don for datastreams (given a list of datastreamIDs) . The caller decides what objects to request, and what to do with the set of returned objects (e.g., it may introspect on relationships (RELS-EXT) and do something graph-like with the set; it may just pump the set into some service for processing; whatever). The main benefit is avoiding the overhead of calling back to the repository to get each object.
  - getObjectSet(arrayOfPIDS); returns set of objects with binary attachments
  - getDatastreamSet( ...same concept)

Note: We should also think about the interplay of querying a triplestore (e.g., SPO on MPTstore; XQuery on DbXML; ITQL on Mulgara(itql) and calls to Fedora APIs to get datastream content.

Note: create/update/delete operations are addressed in requirements for CRUD transactions on graph of objects (described in #1 above).

CMDA: Better Service-to-Object Mapping in Fedora

70/30/0

The CMDA design specification maintains the basic concept of a Fedora disseminator, but implements it differently. For one, it provides an indirect binding of behavior definitions/mechanisms to digital objects. It also opens up more possibilities for dynamic service-to-object mapping/binding in Fedora. The core Fedora development team already has a prototype of the CMDA design and will begin moving towards a production release for Fedora 2.3 or 3.0. See: http://www.cs.cornell.edu/payette/fedora/designs/cmda/

(NOTE: there is similar work is being done by others such as OWL-S and Semantic Web Services.)

Better Support for Storing Very Large Datastreams in Fedora

50/50/0

Enable a new type of datastream known as Managed External (Matthias)
Improvements to SRB/Fedora integration (Andrew)

JMS Messaging in Fedora Framework

80/10/10

JMS Message Broker deployed with Fedora (e.g., ActiveMQ)
Enable Fedora repository service to be message provider, sending out "event" messages for all API-M operations.
Enable existing Fedora services (GSearch, Journaling) to consume Fedora repository event messages. Once consumed, the services will do something with the information - for GSearch an index update via callback to Fedora; for Journaling a replay of the API-M operation.
Look towards Resource Index as service and use of messaging to update.
ESB pattern beyond messaging in service framework

Alternative Interfaces on Core Fedora Repository Service

30/70/0

This is motivated by being able to improve performance by not having to do SOAP requests. Also, it can make for easier integration of Fedora with certain types of applications. The identified interface possibilities are:

JMS: While we already mentioned the repository as a JMS provider (sending out API-M event messages), we should also consider using JMS as an alternate interface for API-M operations themselves (i.e., send messages to request that an API-M operation be performed. To do this, we must define appropriate message types and payloads to map to API-M operations.
Native Java: Consider exposing a native java interface of existing API-M. Might also consider exposing a simplified java interface that maps to existing API-M operations (not sure what this simplified concept means).
Java Content Repository (JSR170): develop a JSR170 interface on Fedora to enable Fedora to play according to this standard. JSR-170 is a JDBC-like API for content repositories. Note that this may have significant benefits in positioning Fedora in the CMS market. (See: http://www.onjava.com/pub/a/onjava/2006/10/04/what-is-java-content-repository.html)
Also, think about other possible approaches, and also what is necessary to better accommodate an inversion of control architectural pattern with Fedora (meeting with Citeseer in May)

Fedora Content Model Registry to Facilitate Sharing/Reuse

20/70/10

Create a registry of Fedora content models to facilitate sharing and re-use. The registry can be implemented as a Fedora repository that stores Fedora "content model objects." Each content model object will have one or more datastreams each representing a different way of expressing content model constraints. A registry of content models would facilitate a bottom-up, organic, Darwinian approach to the sharing of content models (community defines them vs. top down promulgation of models). The idea is to let community create their "best offerings" and share them. If others like the models, they will adopt them.

There is the question as to whether there is one hosted content model registry for the Fedora community, or whether we just provide a reference model for a Content Model Registry (implemented with a running instance of a Fedora repository). The first thing to be done is that the Fedora development team will put out a reference content model object that will include two datastreams - a datastream with an RDFS or OWL expression of the sample content model, and a datastream with a CMDA XML-based expression of the sample content model. The next move is to determine whether to distributed some reference model for a content model registry, or to actually run a content model registry for the community (using Fedora as the registry repository).

Validation of Fedora Objects Based on Content Models

20/60/20

Facilitate standard means to enforce constraints of a content model to in support of digital object validation and integrity checking. This entails having one or more logical expression languages to describe patterns and constraints for Fedora digital objects - that are useful for enforcement of constraints. the favorite mean to date is a shared logical expression language (meta modeling language) describing the pattern and constraints of Fedora digital object (per the fedora object model) and this included the notion of being able to describe the container-to-container (object type to object type rels). This gets at original idea of content model. (This could wind up being subsumed in #5, as it matures. It may merge) - FEDORA specific view of this. - This implies some sort of cmodel expression support by fedora.

Manage a distributed transaction which entails having to update multiple services in the framework (multiple repos; repo plus several services)

0/60/40

Reduce barriers for application developers with Fedora by offering alternate interfaces for Fedora. Enable better community process (developers).

0/40/60

Native Java interface exposed for API-M - is this the core enabler to developing other interfaces and libraries for Fedora?
JSR-170 (Java Content Repository)?

Note: the alternate interfaces requirement was discussed already in requirement #6 above. This motivates the same requirement from a different standpoint, that being making it easier for developers who are writing java-based middleware or applications upon Fedora. The can write directly to java interfaces, and bypass the SOAP interfaces.

GetObject (PID, verisonDate)

10/0/90

Provide the option to get the entire digital object (FOXML) as of a certain version date. You only get back one version of each datastream, and it's the version at the provided version date.

Versioning enhancements

Whole object versioning (Matthias and eSciDoc already have work in this area; see Matthias' OR07 presentation)
Versioning of a whole graph of objects - this is gnarly

To enable "virtual distributed repository" ("fedororation")

0/0/100

Other Points Discussed

Sharing and re-use

The community must focus on the means of sharing and re-using objects from the standpoint of shared content models. What is the best way(s) to promote shared logical expressions/descriptions of the object types (i.e., "content models;" templates; application profiles; whatever) and shared vocabularies. Note that the ORE effort proposes a standard way to share "named sub-graphs." The ORE expression will be a generic graph, but we anticipate that the generic graph expressions can be sub-typed with community semantics. A means of this can be a shared registry of "entity types." We should think about our proposal for "content models" and content model registries, and how much we can generalize this is outside of Fedora. Note registries can enable discovery of content models (or entity types). A use case for sharing is actually developers trying to write apps to CRUD the units of management. Note: facilitate a bottom-up, organic, Darwinian approach where community creates; rather than top down. Note. the different communities working in this area include W3C semantic web, ORE, Topaz-Mulgara, Fedora, others. Fedora Commons can position to be a key player/leader in this area.

Hourglass design

When evolving Fedora keep this principle in mind. Aim for the "slim" IFaP (Interface, Formats, and Protocols) on core repository service (API-A/M). Think about this when we are thinking about adding more verbs to the APIs.

Enduring/Reliable System

WE ALREADY CONSIDER THIS ESSENTIAL

core system must be preservation-enabling
File-centric system (vs. index-centric system)
Enable technology refresh - same data transferred to new software systems
Provide a system that is not vulnerable to unrecoverable disaster
Avoid features or changes would get in the way of preservation
Easy disaster recovery (e.g., rebuilder based on files only)
Federations for replications and avail

Better Extensibility schemes for digital object APIs

Disseminators are extension mechanism for API-A. Can we use the same pattern as a way to extend API-M?
Drive toward CMDA in next release of Fedora. CMDA is a better way to do disseminators on Fedora objects.

XACML outside the repository

Chi from the Australian RAMP project has done extensive work in this area. We must review and figure out how to deploy his work as an alternative configuration for XACML enforcement.