• Home
Name Date Size #Lines LOC

..--

api/03-May-2024-3,9422,855

ccmain/03-May-2024-22,87416,883

ccstruct/03-May-2024-17,21411,186

ccutil/03-May-2024-18,56411,714

classify/03-May-2024-19,90110,611

config/03-May-2024-4,1593,427

cutil/03-May-2024-3,9851,674

dict/03-May-2024-8,5605,414

dlltest/03-May-2024-1,3811,213

doc/03-May-2024-115

image/03-May-2024-6,0354,870

include/03-May-2024-9,2745,700

java/03-May-2024-5,1853,730

liblept/03-May-2024-151,93891,718

tessdata/03-May-2024-2,4122,195

testing/03-May-2024-710546

textord/03-May-2024-32,56024,193

training/03-May-2024-4,9413,713

viewer/03-May-2024-2,7951,859

wordrec/03-May-2024-10,4775,602

.gitignoreD03-May-202440 76

AUTHORSD03-May-2024170 98

Android.mkD03-May-202410.9 KiB540468

COPYINGD03-May-20241 KiB2419

ChangeLogD03-May-20243.4 KiB7270

INSTALLD03-May-20249 KiB230175

INSTALL.SVND03-May-2024432 1610

MODULE_LICENSE_APACHE2D03-May-20240

Makefile.amD03-May-2024827 2010

Makefile.inD03-May-202420 KiB641553

NEWSD03-May-202445 21

READMED03-May-20248.2 KiB139100

ReleaseNotesD03-May-20249.4 KiB214181

acinclude.m4D03-May-20247.1 KiB157156

aclocal.m4D03-May-202432.9 KiB911813

configureD03-May-2024293.4 KiB10,4268,591

configure.acD03-May-202410.2 KiB366313

makemoredistsD03-May-2024685 1311

runautoconfD03-May-20242 KiB6913

tessdll.cppD03-May-20247.9 KiB300190

tessdll.hD03-May-20245.3 KiB14163

tessdll.vcprojD03-May-20246.6 KiB292291

tesseract.slnD03-May-202412.1 KiB175173

tesseract.specD03-May-20245.8 KiB189146

tesseract.vcprojD03-May-20246.1 KiB268267

README

1Note that this is a text-only and possibly out-of-date version of the
2wiki ReadMe, which is located at:
3 http://code.google.com/p/tesseract-ocr/wiki/ReadMe
4
5Introduction
6============
7This package contains the Tesseract Open Source OCR Engine.
8Orignally developed at Hewlett Packard Laboratories Bristol and
9at Hewlett Packard Co, Greeley Colorado, all the code
10in this distribution is now licensed under the Apache License:
11
12** Licensed under the Apache License, Version 2.0 (the "License");
13** you may not use this file except in compliance with the License.
14** You may obtain a copy of the License at
15** http://www.apache.org/licenses/LICENSE-2.0
16** Unless required by applicable law or agreed to in writing, software
17** distributed under the License is distributed on an "AS IS" BASIS,
18** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19** See the License for the specific language governing permissions and
20** limitations under the License.
21
22
23Other Dependencies and Licenses:
24================================
25The Aspirin/MIGRAINES system is no longer required.
26
27Tesseract can also make use of the libtiff library. (www.libtiff.org) See
28http://code.google.com/p/tesseract-ocr/wiki/FAQ for details.
29Without libtiff, Tesseract can only read uncompressed and G3 compressed
30TIFF files.
31
32Installing and Running Tesseract
33All Users Do NOT Ignore!
34The tarballs are split into pieces.
35
36tesseract-2.04.tar.gz contains all the source code.
37
38tesseract-2.00.<lang>.tar.gz contains the language data files for <lang>. You need at least one of these or tesseract will not work.
39
40Note that tesseract-2.04.tar.gz unpacks to the tesseract-2.04 directory. tesseract-2.00.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-2.04 directory. It is therefore best to download them into your tesseract-2.04 directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you care to, as they all contain different files. Note that if you are using make install you should unpack your language data to your source tree before you run make install. If you unpack them as root to the destination directory of make install, then the user ids and access permissions might be messed up.
41
42boxtiff-2.01.<lang>.tar.gz contains data that was used in training for those that want to do their own training. Most users should NOT download these files.
43
44Instructions for using the training tools are documented separately at TrainingTesseract and for testing at TestingTesseract.
45
46Without Additional Libraries, Image format support is limited!
47
48Without additional libraries, Tesseract can only read uncompressed TIFF. (And some versions of BMP) Upto version 2.04, you can add libtiff-dev. See the FAQ question on compressed TIFF for installation instructions. Version 3.00 will support additional formats via Leptonica, but requires more libraries to be added.
49Windows:
50
51There is no windows installer! (Still looking for volunteers to create one.) There are windows executables: tesseract-2.04.exe.tar.gz (It is not for the 'exe' language.) They are built with VC++ express 2008 and come with absolutely no warranty. If they work for you then great, otherwise get Visual C++ Express 2008 with service pack 1 and build from the source. You can also try tesseract-2.01.exe.tar.gz, which is built with VC++6, and may work better if your windows is old, but note that this is an older version of Tesseract.
52
53If you are building from the sources, there are still (up to v2.04) .dsw and .dsp files for vc++6, but the recommended build platform is now VC++ Express 2008. There are also .sln and .vcproj files for VC++ Express 2008, but these files are not backward compatible with any previous version - not even VC++ Express 2005. Note that the executables produced with the newer compiler are smaller, faster, and, believe it or not, more accurate. (See TestingTesseract.)
54
55New with 2.04: the executables are built with static linking, so they stand more chance of working out of the box on more windows systems.
56
57The executable must reside in the same directory as the tessdata directory. (The Visual Studio projects build the release executable directly to the correct place!)
58
59The command line is:
60
61tesseract <image.tif> <output> [-l <langid>]
62
63For interfacing to other applications, there is a DLL included with the executables, but you may be better off building it yourself. The DLL is NOT built for static C-Runtime, so you will probably need VC++ Express 2008 to run it.
64
65The dll has been updated to allow input of non-binary images. (Thanks to Glen of Jetsoft.)
66
67Non-Windows (or Cygwin):
68
69You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either:
70
71./configure
72make
73make install
74
75to move the data files to the standard place, or:
76
77export TESSDATA_PREFIX="directory in which your tessdata resides/"
78
79In either case the command line is:
80
81tesseract <image.tif> <output> [-l <langid>]
82
83New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.) It might work with your OS if you know how to do that.
84
85If you are linking to the libraries, as Ocropus does, there is now a single master library called libtesseract_full.a.
86
87Libtiff support should now be properly working via configure, but note that you need libtiff-dev, as that contains the header files required to compile the code that uses it.
88
89History:
90========
91The engine was developed at Hewlett Packard Laboratories Bristol and
92at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
93more changes made in 1996 to port to Windows, and some C++izing in 1998.
94A lot of the code was written in C, and then some more was written in C++.
95Since then all the code has been converted to at least compile with a C++
96compiler. Currently it builds under Linux with gcc2.95 and under Windows
97with VC++6. The C++ code makes heavy use of a list system using macros.
98This predates stl, was portable before stl, and is more efficent than stl
99lists, but has the big negative that if you do get a segmentation violation,
100it is hard to debug. Another "feature" of the C/C++ split is that the C++
101data structures get converted to C data structures to call the low-level C
102code. This is ugly, and the C++izing of the C code is a step towards
103eliminating the conversion, but it has not happened yet.
104
105The most recent change is that Tesseract can now recognize 6 languages, is fully UTF8 capable, and is fully trainable. See TrainingTesseract for more information on training.
106
107Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. See http://www.isri.unlv.edu/downloads/AT-1995.pdf. With Tesseract 2.00, scripts are now included to allow anyone to reproduce some of these tests. See TestingTesseract for more details.
108
109
110Directory Structure (ordered by dependency):
111============================================
112ccmain     Top-level code. The main program resides in tesseractmain.cpp.
113display    An "editor" to view and operate on the internal structures.
114           (Requires a working viewer - batteries not included.)
115wordrec    The word-level recognizer.
116textord    The module that organizes(orders) text into lines and words.
117classify   The low-level character classifiers.
118ccstruct   Classes to hold information about a page as it is being processed.
119viewer     The client side of a client server viewing system.
120           Unfortunately, at this time, the server side is not available.
121image      Image class and processing functions.
122dict       Language model code.
123cutil      Code for file I/O, lists, heaps etc, from the old C code.
124ccutil     Somewhat newer code for lists, memory allocation etc from the
125           old C++ code.
126
127
128About the Engine
129================
130This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
131FORMATTING, and NO UI. It can only process an image of a single column
132and create text from it. It can detect fixed pitch vs proportional text.
133Having said that, in 1995, this engine was in the top 3 in terms of character
134accuracy, and it compiles and runs on both Linux and Windows.
135As of 2.0, Tesseract is fully unicode (UTF-8) enabled, and can recognize 6
136languages "out of the box." Code and documentation is provided for the brave
137to train in other languages. See code.google.com/p/tesseract-ocr for more
138information on training.
139