TesseractOSCON
Tesseract OCR Engine
What it is, where it came from,
where it is going.
Ray Smith, Google Inc
OSCON 2007
Contents
• Introduction & history of OCR
• Tesseract architecture & methods
• Announcing Tesseract 2.00
• Training Tesseract
• Future enhancements
...
Tesseract OCR Engine
What it is, where it came from,
where it is going.
Ray Smith, Google Inc
OSCON 2007
Contents
• Introduction & history of OCR
• Tesseract architecture & methods
• Announcing Tesseract 2.00
• Training Tesseract
• Future enhancements
A Brief History of OCR
• What is Optical Character Recognition?
My invention relates to statistical machines
of the type in which successive comparisons
are made between a character and a charac-
OCR
A Brief History of OCR
• OCR predates electronic computers!
US Patent 1915993, Filed Apr 27, 1931
A Brief History of OCR
• 1929 – Digit recognition machine
• 1953 – Alphanumeric recognition machine
• 1965 – US Mail sorting
• 1965 – British banking system
• 1976 – Kurzweil reading machine
• 1985 – Hardware-assisted PC software
• 1988 – Software-only PC software
• 1994-2000 – Industry consolidation
Tesseract Background
• Developed on HP-UX at HP between 1985
and 1994 to run in a desktop scanner.
• Came neck and neck with Caere and XIS
in the 1995 UNLV test.
(See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
• Never used in an HP product.
• Open sourced in 2005. Now on:
http://code.google.com/p/tesseract-ocr
• Highly portable.
Tesseract OCR Architecture
Find Text
Lines and
Words
Recognize
Word
Pass 2
Recognize
Word
Pass 1
Adaptive
Thresholding
Connected
Component
Analysis
Input: Gray or Color Image
[+ Region Polygons]
Binary Image
Character
Outlines
Character
Outlines
Organized
Into Words
Adaptive Thresholding is Essential
Some examples of how difficult it can be to make a binary image
Taken from the UNLV Magazine set.
(http://www.isri.unlv.edu/ISRI/OCRtk )
Baselines are rarely perfectly straight
• Text Line Finding – skew independent –
published at ICDAR’95 Montreal.
(http://scholar.google.com/scholar?q=skew+detection+smith)
• Baselines are approximated by quadratic splines
to account for skew and curl.
• Meanline, ascender and descender lines are a
constant displacement from baseline.
• Critical value is the x-height.
Spaces between words are tricky too
• Italics, digits, punctuation all create
special-case font-dependent spacing.
• Fully justified text in narrow columns can
have vastly varying spacing on different
lines.
Tesseract: Recognize Word
Static
Character
Classifier
Dictionary
Character
Chopper
Adaptive
Character
Classifier
Number
Parser
Character
Associator Done?
Adapt to
Word
No
Yes
Outline Approximation
Original Image Outlines of components Polygonal Approximation
Polygonal approximation is a double-edged sword.
Noise and some pertinent information are both lost.
Tesseract: Features and Matching
• Static classifier uses outline fragments as
features. Broken characters are easily
recognizable by a small->large matching
process in classifier. (This is slow.)
• Adaptive classifier uses the same technique!
(Apart from normalization method.)
Prototype Character
to classify
Extracted
Features
Match of
Prototype
To Features
Match of
Features To
Prototype
Announcing tesseract-2.00
• Fully Unicode (UTF-8) capable
• Already trained for 6 Latin-based
languages (Eng, Fra, Ita, Deu, Spa, Nld)
• Code and documented process to train at
http://code.google.com/p/tesseract-ocr
• UNLV regression test framework
• Other minor fixes
Training Tesseract
Word List
Word-dawg,
Freq-dawg
inttemp,
pffmtable
normproto
unicharset
DangAmbigs
User-words
Character
Features
(*.tr files)
Training
page images
Box files unicharset
Tesseract Data Files
Wordlist2dawg
mfTraining
cnTraining
Unicharset_extractor Addition of
character
properties
Manual
Data Entry
Tesseract
Tesseract
+manual
correction
Tesseract Dictionaries
Word List
Word-dawg,
Freq-dawg
User-words
Tesseract Data Files
Wordlist2dawg
Usually Empty
Infrequent
Word List
Frequent
Word List
Tesseract Shape Data
inttemp,
pffmtable
normproto
Character
Features
(*.tr files)
Training
page images
Box files
Tesseract Data Files
mfTraining
cnTraining
Tesseract
Tesseract
+manual
correction
Prototype Shape Features
Expected Feature Counts
Character Normalization Features
Tesseract Character Data
unicharset
DangAmbigs
Training
page images
Box files unicharset
Tesseract Data Files
Unicharset_extractor Addition of
character
properties
Manual
Data Entry
Tesseract
+manual
correction
List of Characters + ctype information
Typical OCR errors eg e<->c, rn<->m etc
Accuracy Results
-6.58%12347-10.37%57171TotalGcc4.1
-18.77%97.51%122016.98%98.47%7524News.3BGcc4.1
-7.58%95.37%3123-1.62%97.78%14800Mag.3BGcc4.1
-4.97%95.12%6692-21.35%98.05%28589Doe3.3BGcc4.1
1.47%95.67%13125.02%98.04%6258Bus.3BGcc4.1
96.94%150298.69%6432News.3B1995
94.99%337997.74%15043Mag.3B1995
94.87%704297.52%36349Doe3.3B1995
95.73%129398.14%5959Bus.3B1995
ChangeAccuracyErrorsChangeAccuracyErrors
Non-stopwordCharacterTestsetTestid
Comparison of current results against 1995 UNLV results
Commercial OCR v Tesseract
• 6 languages + growing.
• Accuracy was good in
1995.
• No UI yet.
• Page layout analysis
coming soon.
• Runs on Linux, Mac,
Windows, more...
• Open source – Free!
• 100+ languages.
• Accuracy is good
now.
• Sophisticated app
with complex UI.
• Works on complex
magazine pages.
• Windows Mostly.
• Costs $130-$500
Tesseract Future
• Page layout analysis.
• More languages.
• Improve accuracy.
• Add a UI.
The End
• For more information see:
http://code.google.com/p/tesseract-ocr
本文档为【TesseractOSCON】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。