Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Given that IngestContexts may be read and written to frequently, most file-based serializations (e.g. XML) do not seem attractive. The obvious alternative is the RDBMS, but we recall that CGI is an add-on, so we are reluctant to graft tables or columns to the standard DSpace database_schema.sql. Worse still, we have no Oracle licenses, (or expertise), etc. so cannot ensure that our database logic is portable across vendors. In fact, there is no existing practice for add-ons using DB tables at all. Is there a way forward? Fortunately, considerable work has been done in the area of virtualizing access to SQL data sources since DSpace was first written, and industry-standard, widely supported, performant, open source tools exist. The current Java enterprise standard is JPA (Java Persistence API), which we will can use as the persistence layer for CGI data. These so-called 'ORM' (object-relational mapping) tools combine Java native OO semantics with SQL constructs (e.g. query languages) to provide familiar but powerful programming idioms. Add to this convenient Java 5-style annotations to POJOs, and the result is concise but readable database code. As an illustration, here is a complete implementation of the IngestContext service class (remember again, this is merely demo code):

...

The situation with Ingest Resources is quite different, as noted above: current DSpace practice (which basically all derives from configurable submission) is to encode resource data in an XML file, which is parsed at runtime (typically by an affiliated 'Reader' helper class) to create immutable (read-only) access objects. In this sense, there really is no resource persistence problem, since XML disk files are quite persistent. Rather, resource access is the real problem: the CGI implementation provides resource objects, and would prefer a uniform way of accessing them. In fact, it is interesting to note that the same XML files typically also contain what CGI would call a resource map: that is, a set of values mapped to resource instances. These facts suggest a number of strategies for working with XML-based resources. We summarize each below, noting some costs and benefits. But it is quite important to understand first, that these strategies can be combined opportunistically, and second, that any strategy can be revisited or overturned as new developer resources or time permits. Past experience suggests that constraints on developer time and availability will require low-barrier strategies to be adopted initially, with more robust ones pursued later.

...

In this case, we do (almost) nothing to the resource implementation itself. The existing ReadReaders are used to fetch the objects identified by the service. In this case, the main cost is loss of uniform access: for each such resource, CGI we have to have hard-coded knowledge of how to obtain it. Presumably this will not be intolerable for a small number of resource types. And at worst, today DSpace has perhaps four to six resource types (metadata templates, input forms, submission configuration, curation task set, etc). But even if resources are grandfathered, the map data co-resident in the XML files cannot be. One would have to parse and load this data into resource mapping persistent objects, and synchronization of changes would be clumsy.

Shimming (DAOs)

If the expectation is that there will likely always be heterogeneous resource persistence solutions (some XML files, some DB-resident, some flat file, etc), we could still achieve CGI uniform resource access at the cost of adding a layer of indirection, a shim, or interface for resources. Essentially, this would entail a new family of DAO-like objects that could abstract the actual means of obtaining resource data (i.e. parsing an XML file, vs a database query). The CGI service code would only work with these DAO surrogates. The cost here is a fair amount of glue code, which could in the end be discarded if a single resource persistence method is adopted.

Peer Replacement (JPA)

Finally, we could imagine beginning to replace each current resource type with an equivalent (a peer) whose state is persisted in the database in the same way as other CGI data - i.e. managed via JPA. This is architecturally fairly pure in that each resource would be a simple POJO but would be persistent in a portable way. It would, however, have far more upfront costs than any of the other strategies because:

  • we could not re-use much if any of the XML-based code, instead having to re-implement the resource class
  • we would have to provide conversion services so that the XML-resident data was migrated to the new DB tables
  • since the resource data only now lives in a database, we would likely also have to write a UI for editing (probably a different UI for each type, in fact)

Still, if we could eventually achieve this state for every resource type, it would answer the wishes of quite a few DSpace users who want UI editable input forms, etc

User Interface

It will have been observed that nothing has been said about end user interactions with CGI services or functionality. In part, this is due to the fact that CGI is really a set of infrastructure services for application code, not the direct end-user. But it clearly implies that users in some way will have to manage the CGI service data, which primarily is the resource mappings. Also, for each resource type that is implemented as a JPA managed bean (see peer replacement above), we must provide some administrative UI (or other tools), since obviously we cannot require raw RDBMS administration to manage those resources and their mappings. This requirement is an additional significant cost to be factored into the calculation about which near and long-term strategy can be pursued. It is realistic to suppose that for some significant period, the old (XML editing) and new (DB-backed, UI edited) methods will co-exist.