MediaFilter Task Suite

DSpace has long had an application framework known as MediaFilter (aka filter-media, the name of the script used to invoke it) for enhancing DSpace content by reading content files (Item bitstreams), transforming them into useful derivatives, then typically adding them as new bitstreams to the item. The framework was designed to be extensible, and as of DSpace 1.8 filters include:

Thumbnail Images - small images generated from larger image bitstreams
Branded Images - image transforms with embedded text
Text Extraction - pulling text out of formatted documents for indexing (supported formats: PDF, Word, HTML, Powerpoint)

The programming model of MediaFilter is very simple: it traverses the repository item-by-item (or confines itself to a community or collection), and examines each bitstream. If there is any 'filter' installed in the framework that 'understands' the bitstream, it is run. Understanding means that the filter has registered the fact that it can operate on a given file format.

Issues and Limitations

Although quite useful over a range of cases, a number of issues may arise in using MediaFilter. To cite a few significant ones:

Admin Only Operation

It has a very narrow context of invocation. It was designed to be an 'administrative' function, meaning that it can only be run by a system administrator using a command-line interface tool. There is no web UI, or any means for non-'super-users' to access it's abilities.

Configuration Difficulty

Configuration is rigid and brittle. All pertinent configuration is contained in java source code, so that to change, e.g., the bundle that is examined for text extraction one has to modify the source code. This is particularly irksome in that the MediaFilter packages are in the 'dspace-api' library, which generally should not be modified.

Inflexible Data Model

An example can best illustrate this. A very common policy for image content is to offer a low resolution version without restriction, but control access to the higher resolution original version. If the the low-res is created by MediaFilter from the high-res, there is no way to assert different resource policies for the filtered artifact.

Traversal Scalability

As noted above, MediaFilter walks the entire repository (or container) and examines each item, even if that item has already been processed. As repositories grow, this not only becomes inefficient, but makes it difficult to fit the MediaFilter run in any sensibile maintenance 'window'.

High Code Maintenance Costs

Since each filtering operation relies on the services of a format-specific third-party library, the more capability the framework adds, the more dependencies must be added, tracked, and periodically updated.

Curation Framework

Many of these concerns can be rectified if we imagine 're-homing' the functionality of MediaFilter into a set of curation tasks. For example, the 'Admin Only Operation' disappears since all curation tasks can be invoked not only via administrative command-line, but in the admin UI, or indeed in workflow, submission, etc. Properly rewritten curation code can also break out all configuration into modular config files.

'Outsourcing' - Tika Framework

Many core MediaFilter operations are not unique to institutional repositories. Text extraction, for example, is practiced widely by applications that retrieve content on the web and need to index it. DSpace may thus leverage existing art where appropriate. We are evaluating the Apache Tika framework in this light. Tika is part of the larger ecosystem that grew around Lucene, Nutch, Hadoop, SOLR, etc and is concerned with content analysis and data extraction from documents. It has been integrated, e.g. into JackRabbit (the reference implementation of Java Content Repository JSR), and other digital asset management systems. This could help address the 'High Code Maintenance' issue: the Tika community can shoulder the burden of ensuring the latest and best components.

Even in the current Tika 1.0 release, we could greatly expand the functionality of text extraction in MediaFilter: in addition to PDF, Word, HTML, and Powerpoint, there are Tika parsers for XML, OpenDocument, audio, video, EPub, and many others

Page tree

MediaFilter Task Suite

MediaFilter Task Suite

Issues and Limitations

Admin Only Operation

Configuration Difficulty

Inflexible Data Model

Traversal Scalability

High Code Maintenance Costs

Curation Framework

'Outsourcing' - Tika Framework