All Versions
- DSpace 7.x (Current Release)
- DSpace 8.x (Unreleased)
- DSpace 6.x (EOL)
- DSpace 5.x (EOL)
- More Versions...
...
This is an alternative suite of MediaFilter plugins that offers faster and more reliable text extraction from PDF Bitstreams, as well as thumbnail image generation. It replaces the built-in default PDF MediaFilter.
If this filter is so much better, why isn't it the default? The answer is that it relies on external executable programs which must be obtained and installed for your server platform. This would add too much complexity to the installation process, so it left out as an optional "extra" step.
Here are the steps required to install and configure the filters:
First, download the XPDF suite found at: http://www.foolabs.com/xpdf and install it on your server. The executables can be located anywhere, but make a note of the full path to each command.
You may be able to download a binary distribution for your platform, which simplifies installation. Xpdf is readily available for Linux, Solaris, MacOSX, Windows, NetBSD, HP-UX, AIX, and OpenVMS, and is reported to work on AIX, OS/2, and many other systems.
The only tools you really need are:
Fetch and install the Java Advanced Imaging Image I/O Tools.
For AIX, Sun support has the following: "JAI has native acceleration for the above but it also works in pure Java mode. So as long as you have an appropriate JDK for AIX (1.3 or later, I believe), you should be able to use it. You can download any of them, extract just the jars, and put those in your $CLASSPATH."
Download the jai_imageio library version 1.0_01 or 1.1 found at: https://jai-imageio.dev.java.net/binary-builds.html#Stable_builds .
For these filters you do NOT have to worry about the native code, just the JAR, so choose a download for any platform.
Code Block |
---|
curl -O http://download.java.net/media/jai-imageio/builds/release/1.1/jai_imageio-1_1-lib-linux-i586.tar.gz
tar xzf jai_imageio-1_1-lib-linux-i586.tar.gz
|
The preceding example leaves the JAR in jai_imageio-1_1/lib/jai_imageio.jar . Now install it in your local Maven repository, e.g.: (changing the path after file= if necessary)
Code Block |
---|
mvn install:install-file \
-Dfile=jai_imageio-1_1/lib/jai_imageio.jar \
-DgroupId=com.sun.media \
-DartifactId=jai_imageio \
-Dversion=1.0_01 \
-Dpackaging=jar \
-DgeneratePom=true
|
You may have to repeat this procedure for the jai_core.jar library, as well, if it is not available in any of the public Maven repositories. Once acquired, this command installs it locally:
Code Block |
---|
mvn install:install-file -Dfile=jai_core-1.1.2_01.jar \
-DgroupId=javax.media -DartifactId=jai_core -Dversion=1.1.2_01 -Dpackaging=jar -DgeneratePom=true |
First, be sure there is a value for thumbnail.maxwidth and that it corresponds to the size you want for preview images for the UI, e.g.: (NOTE: this code doesn't pay any attention to thumbnail.maxheight but it's best to set it too so the other thumbnail filters make square images.)
Code Block |
---|
# maximum width and height of generated thumbnails
thumbnail.maxwidth= 80
thumbnail.maxheight = 80 |
Now, add the absolute paths to the XPDF tools you installed. In this example they are installed under /usr/local/bin (a logical place on Linux and MacOSX), but they may be anywhere.
Code Block |
---|
xpdf.path.pdftotext = /usr/local/bin/pdftotext
xpdf.path.pdftoppm = /usr/local/bin/pdftoppm
xpdf.path.pdfinfo = /usr/local/bin/pdfinfo |
Change the MediaFilter plugin configuration to remove the old org.dspace.app.mediafilter.PDFFilter and add the new filters, e.g: (New sections are in bold)
Code Block |
---|
filter.plugins = \
PDF Text Extractor, \
PDF Thumbnail, \
HTML Text Extractor, \
Word Text Extractor, \
JPEG Thumbnail
plugin.named.org.dspace.app.mediafilter.FormatFilter = \
org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \
org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \
org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG |
Then add the input format configuration properties for each of the new filters, e.g.:
Code Block |
---|
filter.org.dspace.app.mediafilter.XPDF2Thumbnail.inputFormats = Adobe PDF
filter.org.dspace.app.mediafilter.XPDF2Text.inputFormats = Adobe PDF |
Finally, if you want PDF thumbnail images, don't forget to add that filter name to the filter.plugins property, e.g.:
Code Block |
---|
filter.plugins = PDF Thumbnail, PDF Text Extractor, ... |
Follow your usual DSpace installation/update procedure, only add -Pxpdf-mediafilter-support to the Maven invocation:
Code Block |
---|
mvn -Pxpdf-mediafilter-support package
ant -Dconfig=\[dspace\]/config/dspace.cfg update |
...