CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Suitability of OCR Engines in Information Extraction Systems: a Comparative Evaluation
KTH, School of Electrical Engineering and Computer Science (EECS).
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Previous research has compared the performance of OCR (optical character recognition) engines strictly for character recognition purposes. However, comparisons of OCR engines and their suitability as an intermediate tool for information extraction systems has not previously been examined thoroughly. This thesis compares the two popular OCR engines Tesseract OCR and Google Cloud Vision for use in an information extraction system for automatic extraction of data from a financial PDF document. It also highlights findings regarding the most important features of an OCR engine for use in an information extraction system, when it comes to structure of output as well as accuracy of recognitions. The results show a statistically signifant increase in accuracy for the Tesseract implementation compared to the Google Cloud Vision one, despite previous research showing that Google Cloud Vision outperforms Tesseract in terms of accuracy. This was accredited to Tesseract producing more predictable output in terms of structure, as well as the nature of the document which allowed for smaller OCR processing mistakes to be corrected during the extraction stage. The extraction system makes use of the aforementioned OCR correctional procedures as well as an ad-hoc type system based on the nature of the document and its fields in order to further increase the accuracy of the holistic system. Results for each of the extraction modes for each OCR engine are presented in terms of average accuracy across the test suite consisting of 115 documents.

Abstract [sv]

Tidigare forskning har gjorts som jämför prestandan av OCR-motorer (optical character recognition) uteslutande för dess teckenläsande egenskaper. Jämförelser för OCR-motorer som verktyg för system för informationsextraktion har däremot inte gjorts tidigare. Det här examensarbetet jämför de två populära OCR-motorerna Tesseract OCR och Google Cloud Vision för användning i ett system som används för automatisk extraktion av data från ett finansiellt PDFdokument. Arbetet belyser även observationer angående vilka de viktigaste egenskaperna hos en OCR-motor är för användning i ett system för informationsextraktion. Resultaten visade en statistisk signifikant ökning i exakthet för implementationen med Tesseract jämfört med Google Cloud Vision, trots tidigare forskning som visar att Google Cloud Vision kan utföra teckenläsning mer exakt. Detta ackrediteras till det faktum att Tesseract producerar mer konsekvent utdata när det kommer till struktur, och att vissa felaktiga teckeninläsningar kan korrigeras av extraktionssystemet. Extraktionssystemet använder sig av ovan nämnd OCR-rättande metodik samt ett ad-hoc typsystem baserat på dokumentets innehåll för att öka exaktheten för det holistiska systemet. Dessa metoder kan även isoleras till enskilda extraktionslägen. Resultat för varje extraktionsläge presenteras genom genomsnittlig exakthet över testsviten som bestod av 115 dokument.

Place, publisher, year, edition, pages
2019. , p. 75
Series
TRITA-EECS-EX ; 2019:490
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-255021OAI: oai:DiVA.org:kth-255021DiVA, id: diva2:1337270
Examiners
Available from: 2019-07-12 Created: 2019-07-12 Last updated: 2019-07-12Bibliographically approved

Open Access in DiVA

fulltext(1948 kB)14 downloads
File information
File name FULLTEXT01.pdfFile size 1948 kBChecksum SHA-512
f692b60499c4b53043c6677f652610965ec9b373b0e040b2a47978923de764d9a3dd12385bc7a2183674ca0d3079887699ffa37e19baf5fcabb9e5934efca0f1
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 14 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 19 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf