Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Note: This is a technical brief, and presumes knowledge of DSpace software architecture and internals. End-user documentation is forthcoming.

Introduction

The variety, nature, and complexity of pathways along which content travels to get into the repository has greatly increased since DSpace was first launched, yet the basic platform mechanisms to support ingest have changed little to none. Originally, DSpace ingest was seen foremost as a content author's web submission process, comprising a fixed, multipage set of forms, followed by an optional, but also fixed, set of workflow stages that largely just replayed the submission steps for other actors. Since this sequence of operations was suspendable and restartable (thus beyond what a http session could manage), it was necessary to persist state information relevant to the ongoing submission or workflow object. Architecturally, this need was addressed by creating content wrapper objects (viz. WorkspaceItem and WorkflowItem) which held - in addition to a reference to the item being submitted - various specific properties used in the ingest process (last page reached, destination collection, etc). Since the wrappers are first-class objects in the data model, they belong to the DB schema, and the content API.

However, as new means of ingest were added, and existing processes were made more flexible and configurable, the brittleness of the simple wrapper approach becomes clear. For example, when ingest is via SWORD (i.e. no web submission form), how should properties like 'isPublishedBefore' associated with the 'initial questions' be set? Or suppose we want to add a new question to the initial questions to capture a content format type, where do we store it? Wrapper properties are both context-specific and non-extensible (hard-coded), with the undesirable result that to change them for customization purposes we must alter the wrapper classes themselves. And, as just noted above, this means changing the content API itself - not an outcome that good software design should necessitate.

CGI ('Context-Guided-Ingest') is an attempt the remedy this problem, and improve ingest functionality generally, by providing a small set of service APIs and implementations that constitute more flexible and extensible equivalents to the content API wrappers. To ensure compatibility with previous work, as well as enhance general system modularity, CGI will be offered as a separate add-on or module and require no changes to the content API. Of course to utilize these services, new application code will have to be written. Over time, if CGI services prove useful, it may make sense to begin to abandon or deprecate the wrapper approach, but this will not be an immediate requirement of using CGI.

Concepts

To best describe the services, we will introduce a few terms with specific meaning in the domain model of CGI. First, an IngestContext (or merely context when contextually clear (wink) ) means a container of persistent state with a determinate life-cycle associated with a specific DSpace item. One could imagine other contexts (for Collections, etc), but since Items are the units of ingest, the range of the IngestContexts is restricted to them. An attribute is a simple name/value pair that belongs to the persistent state of the context. Generally, the life-cycle of an IngestContext begins whenever an API call asserting an attribute on a DSpace item occurs, and ends whenever the item is installed in the repository. The intent is that the context will live for the duration of the ingest, and then be discarded. This is very important to understand, that the IngestContext does not represent any kind of permanent 'extended item metadata' facility, although it can of course hold metadata for the item. Finally, an IngestResource (again, usually just resource), is an instance of any class that can be utilized to perform ingest functions (in UI, workflow, etc) that is not specific to an item. Examples of resources might be metadata templates, input forms, submission steps, curation task sets, etc. CGI places no restrictions on what could count as a resource.

In broad strokes then, CGI provides the following service APIs (and clients):

  • IngestContext attribute management for items during ingest (ingest code)
  • IngestResource mapping: attaching lookup keys to resources (repository or collection managers, typically)

Combining these 2 services allows application code to select appropriate resources based on context attributes: thus have context help guide ingest.

Service APIs

To make the preceding descriptions a little more concrete, but also to illustrate how simple they might be, here are some prototype APIs.
NB: these are for illustrative purposes only, and are subject to change or outright abandonment.
ResourceMapService

// map key to instance of resource Class resClass identified by resId
public void map(String key, Class resClass, String resId);

// remove said mapping
public void unmap(String key, Class resClass, String resId);

// return all keys mapped to resource
public Set<String> keySet(Class resClass, String resId);

IngestService

// set an attribute in the IngestContext for item
public void setAttribute(int itemId, String name, String value);

// get an attribute from the IngestContext for item
public void getAttribute(int itemId, String name);

// remove Ingest context
public void clear(int itemId);

// obtain the resource we need
public <T> T findResource(int itemId, Class<T> resClass);

Finding Resources

The observant reader will notice that the IngestService findResource method returns a resource (of a given type) for an item without any additional parameters or information. How is this possible? On what basis is the selection made? CGI assumes that, in general, each resource type bears a similar relationship with each of its items, and thus that it can be expressed in a rule or formula. For reasons that will become apparent, these rules are known as expressions in a grammar called RCL (Resource Composition Language). In most cases, RCL just amounts to a very simple template. For example, suppose we wanted to express the rule that DSpace currently uses in selecting an input form for configurable submission. We would write the property containing the expression as:

org.dspace.app.util.DCInputSet=collection:?,collection:default

That is, the resource class name is the property key, and the value is the RCL expression. The expression means: select the resource mapped to the key formed by the current ingest context attribute 'collection' (if any), otherwise look for a mapping with the key 'collection:default'. More will be said about RCL when we discuss resource composition. These properties - one for each ingest resource we wish to define - can live in standard configuration locations, e.g.

{dspace.dir}/config/modules/cgi.cfg
  • No labels