All Versions
- DSpace 7.x (Current Release)
- DSpace 8.x (Unreleased)
- DSpace 6.x (EOL)
- DSpace 5.x (EOL)
- More Versions...
Unsupported Release
This documentation relates to DSpace 1.8.x, an old, unsupported version. Looking for another version? See all documentation.
As of January 2015, DSpace 1.8.x is no longer supported. We recommend upgrading to a more recent version of DSpace. See DSpace Software Support Policy.
DSpace can apply filters or transformations to files/bitstreams, creating new content. Filters are included that extract text for full-text searching, and create thumbnails for items that contain images. The media filters are controlled by the dspace filter-media
script which traverses the asset store, invoking all configured MediaFilter
or FormatFilter
classes on files/bitstreams (see Configuring Media Filters for more information on how they are configured).
Below is a listing of all currently available Media Filters, and what they actually do:
Name |
Java Class |
Function |
Enabled by Default? |
---|---|---|---|
HTML Text Extractor |
|
extracts the full text of HTML documents for full text indexing |
true |
JPEG Thumbnail |
|
creates thumbnail images of GIF, JPEG and PNG files |
true |
Branded Preview JPEG |
|
creates a branded preview image for GIF, JPEG and PNG files (disabled by default) |
false |
PDF Text Extractor |
|
extracts the full text of Adobe PDF documents (only if text-based or OCRed) for full text indexing |
true |
Word Text Extractor |
|
extracts the full text of Microsoft Word or Plain Text documents for full text indexing |
true |
PowerPoint Text Extractor |
|
extracts the full text of slides and notes in Microsoft PowerPoint and PowerPoint XML documents for full text indexing |
true |
Please note that the filter-media
script will automatically update the DSpace search index by default (see ReIndexing Content (for Browse or Search)) This is the recommended way to run these scripts. But, should you wish to disable it, you can pass the -n flag to either script to do so (see Executing (via Command Line)) below).
The media filter plugin configuration filter.plugins
in dspace.cfg
contains a list of all enabled media/format filter plugins (see Configuring Media Filters for more information). By modifying the value of filter.plugins
you can disable or enable MediaFilter plugins.
The media filter system is intended to be run from the command line (or regularly as a cron task):
[dspace]/bin/dspace filter-media
With no options, this traverses the asset store, applying media filters to bitstreams, and skipping bitstreams that have already been filtered.
Available Command-Line Options:
Help : [dspace]/bin/dspace filter-media -h
Force mode : [dspace]/bin/dspace filter-media -f
Identifier mode : [dspace]/bin/dspace filter-media -i 123456789/2
Maximum mode : [dspace]/bin/dspace filter-media -m 1000
No-Index mode : [dspace]/bin/dspace filter-media -n
index-update
elsewhere.Plugin mode : [dspace]/bin/dspace filter-media -p "PDF Text Extractor","Word Text Extractor"
Skip mode : [dspace]/bin/dspace filter-media -s 123456789/9,123456789/100
[dspace]/bin/dspace filter-media -s `less filter-skiplist.txt`
Verbose mode : [dspace]/bin/dspace filter-media -v
org.dspace.app.mediafilter.FormatFilter
interface. See the Creating a new Media/Format Filter topic and comments in the source file FormatFilter.java
for more information. In theory filters could be implemented in any programming language (C, Perl, etc.) However, they need to be invoked by the Java code in the Media Filter class that you create.