Date: Thu, 28 Mar 2024 11:59:32 -0400 (EDT) Message-ID: <1293677190.28250.1711641572441@lyrasis1-roc-mp1> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_28249_1056567433.1711641572441" ------=_Part_28249_1056567433.1711641572441 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Tesseract is an Optical Character Recognition program that Islandora use= s to extract text from images to files that can then be appended to an obje= ct as datastreams. It supports HOCR standards, and when invoked, Islandora = will use it to create both HOCR and raw OCR output. Tesseract supports mult= iple languages, the installation of which are recognized by the Islandora OCR module.
For Linux installations: While it is likely that your d= istribution's package manager may contain Tesseract in one of its repositor= ies, it is EXTREMELY unlikely that it will be the correct = version. For the Islandora OCR module to create OCR derivatives, Tesseract = 3.02.02 or higher is required. At the time of writing, this is the latest s= table version. THIS MEANS THAT IT IS LIKELY THAT YOU WILL HAVE TO C= OMPILE IT FROM SOURCE.
Tesseract is managed by a team at Google; the latest stable release can = be found on the downloads page of their website, https://code.google.com/p/tesseract-ocr/downloads/list. A= binary installer exists for Windows, and specific instructions for install= ing on a Mac through homebrew can be found in the Tesseract = readme here: https://code.google.com/p/t= esseract-ocr/wiki/ReadMe. For Linux users, or any others compiling= it from source, you will need to make sure that you also have the Leptonic= a library installed, and that you have appropriate source building tools.= p>
Tesseract requires little configuration out of the box; that being said,= Islandora supports the installation of multiple languages for OCR processi= ng, and may even require English language support.. These additional langua= ges can be found on Tesseract's download page.
To install additional languages into Islandora, you will need to know th= e path to your Tesseract installation's 'tessdata' folder. On Windows, this= will tend to be C:\Program Files (x86)\Tesseract OCR\tessdata, and on Mac,= this will tend to be /usr/local/Cellar/tesseract/<version>/shar= e/tessdata - in both cases, if you've used the Tesseract website's own inst= allation case. On Linux, the path will vary from distribution to distributi= on, but will often be /usr/local/share/tessdata or /usr/share/tessdata. Onc= e you have found the correct folder,
Your new language should now be available to perform OCR on Paged Conten= t.