Page History

Overview

This module acts a Toolkit for generating OCR and word coordinate information. At the moment it relies exclusively on Tesseract to generate this information.

Tesseract

Tesseract is an OCR engine that was developed at HP Labs between 1985 and 1995. It is currently being developed at Google. Recognized as one of the most accurate open source OCR engines available, Tesseract will read binary, grey, or colour images and output text.

A TIFF reader that will read uncompressed TIFF images is also included. Islandora Book Solution Pack currently uses Tesseract version 3.2.2, which can be obtained from the project home page. Lower versions are not supportedThe Islandora OCR module integrates Tesseract into the Islandora Paged Content module. It allows for creation of OCR and HOCR derivatives that can be appended to a page as a datastream. Check the instructions for the OCR-compatible module you wish to use for specifics on how to create OCR derivatives.

Dependencies

Islandora
Tuque
Tesseract - Ready for Review (LJ) (3.02.02 or later)
ImageMagic (Optional, Required for OCR preprocessing)
Islandora Paged Content - Ready for Review (KS) (Optional)

...

Release Notes and Downloads

Installation

Install as usual, see this for further information.

Configuration

Configuration options for the Islandora OCR module can be found at http://path.to.your.site/admin/islandora/ocr, and include the following options:

...

Page tree

Versions Compared

Old Version 1

New Version 2

Key

Overview

Dependencies

Installation

Configuration