TesseractOSCON

TesseractOSCON Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007 Contents • Introduction & history of OCR • Tesseract architecture & methods • Announcing Tesseract 2.00 • Training Tesseract • Future enhancements ...

Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007 Contents • Introduction & history of OCR • Tesseract architecture & methods • Announcing Tesseract 2.00 • Training Tesseract • Future enhancements A Brief History of OCR • What is Optical Character Recognition? My invention relates to statistical machines of the type in which successive comparisons are made between a character and a charac- OCR A Brief History of OCR • OCR predates electronic computers! US Patent 1915993, Filed Apr 27, 1931 A Brief History of OCR • 1929 – Digit recognition machine • 1953 – Alphanumeric recognition machine • 1965 – US Mail sorting • 1965 – British banking system • 1976 – Kurzweil reading machine • 1985 – Hardware-assisted PC software • 1988 – Software-only PC software • 1994-2000 – Industry consolidation Tesseract Background • Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. • Came neck and neck with Caere and XIS in the 1995 UNLV test. (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) • Never used in an HP product. • Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr • Highly portable. Tesseract OCR Architecture Find Text Lines and Words Recognize Word Pass 2 Recognize Word Pass 1 Adaptive Thresholding Connected Component Analysis Input: Gray or Color Image [+ Region Polygons] Binary Image Character Outlines Character Outlines Organized Into Words Adaptive Thresholding is Essential Some examples of how difficult it can be to make a binary image Taken from the UNLV Magazine set. (http://www.isri.unlv.edu/ISRI/OCRtk ) Baselines are rarely perfectly straight • Text Line Finding – skew independent – published at ICDAR’95 Montreal. (http://scholar.google.com/scholar?q=skew+detection+smith) • Baselines are approximated by quadratic splines to account for skew and curl. • Meanline, ascender and descender lines are a constant displacement from baseline. • Critical value is the x-height. Spaces between words are tricky too • Italics, digits, punctuation all create special-case font-dependent spacing. • Fully justified text in narrow columns can have vastly varying spacing on different lines. Tesseract: Recognize Word Static Character Classifier Dictionary Character Chopper Adaptive Character Classifier Number Parser Character Associator Done? Adapt to Word No Yes Outline Approximation Original Image Outlines of components Polygonal Approximation Polygonal approximation is a double-edged sword. Noise and some pertinent information are both lost. Tesseract: Features and Matching • Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) • Adaptive classifier uses the same technique! (Apart from normalization method.) Prototype Character to classify Extracted Features Match of Prototype To Features Match of Features To Prototype Announcing tesseract-2.00 • Fully Unicode (UTF-8) capable • Already trained for 6 Latin-based languages (Eng, Fra, Ita, Deu, Spa, Nld) • Code and documented process to train at http://code.google.com/p/tesseract-ocr • UNLV regression test framework • Other minor fixes Training Tesseract Word List Word-dawg, Freq-dawg inttemp, pffmtable normproto unicharset DangAmbigs User-words Character Features (*.tr files) Training page images Box files unicharset Tesseract Data Files Wordlist2dawg mfTraining cnTraining Unicharset_extractor Addition of character properties Manual Data Entry Tesseract Tesseract +manual correction Tesseract Dictionaries Word List Word-dawg, Freq-dawg User-words Tesseract Data Files Wordlist2dawg Usually Empty Infrequent Word List Frequent Word List Tesseract Shape Data inttemp, pffmtable normproto Character Features (*.tr files) Training page images Box files Tesseract Data Files mfTraining cnTraining Tesseract Tesseract +manual correction Prototype Shape Features Expected Feature Counts Character Normalization Features Tesseract Character Data unicharset DangAmbigs Training page images Box files unicharset Tesseract Data Files Unicharset_extractor Addition of character properties Manual Data Entry Tesseract +manual correction List of Characters + ctype information Typical OCR errors eg e<->c, rn<->m etc Accuracy Results -6.58%12347-10.37%57171TotalGcc4.1 -18.77%97.51%122016.98%98.47%7524News.3BGcc4.1 -7.58%95.37%3123-1.62%97.78%14800Mag.3BGcc4.1 -4.97%95.12%6692-21.35%98.05%28589Doe3.3BGcc4.1 1.47%95.67%13125.02%98.04%6258Bus.3BGcc4.1 96.94%150298.69%6432News.3B1995 94.99%337997.74%15043Mag.3B1995 94.87%704297.52%36349Doe3.3B1995 95.73%129398.14%5959Bus.3B1995 ChangeAccuracyErrorsChangeAccuracyErrors Non-stopwordCharacterTestsetTestid Comparison of current results against 1995 UNLV results Commercial OCR v Tesseract • 6 languages + growing. • Accuracy was good in 1995. • No UI yet. • Page layout analysis coming soon. • Runs on Linux, Mac, Windows, more... • Open source – Free! • 100+ languages. • Accuracy is good now. • Sophisticated app with complex UI. • Works on complex magazine pages. • Windows Mostly. • Costs $130-$500 Tesseract Future • Page layout analysis. • More languages. • Improve accuracy. • Add a UI. The End • For more information see: http://code.google.com/p/tesseract-ocr

                    本文档为【TesseractOSCON】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

TesseractOSCON

你可能还喜欢