Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automated invoice handling with machine learning and OCR
KTH, School of Technology and Health (STH), Medical Engineering, Computer and Electronic Engineering.
KTH, School of Technology and Health (STH), Medical Engineering, Computer and Electronic Engineering.
2016 (English)Independent thesis Basic level (professional degree), 10 credits / 15 HE creditsStudent thesisAlternative title
Automatiserad fakturahantering med maskininlärning och OCR (Swedish)
Abstract [en]

Companies often process invoices manually, therefore automation could reduce manual labor. The aim of this thesis is to evaluate which OCR-engine, Tesseract or OCRopus, performs best at interpreting invoices. This thesis also evaluates if it is possible to use machine learning to automatically process invoices based on previously stored data.

By interpreting invoices with the OCR-engines, it results in the output text having few spelling errors. However, the invoice structure is lost, making it impossible to interpret the corresponding fields. If Naïve Bayes is chosen as the algorithm for machine learning, the prototype can correctly classify recurring invoice lines after a set of data has been processed.

The conclusion is, neither of the two OCR-engines can interpret the invoices to plain text making it understandable. Machine learning with Naïve Bayes works on invoices if there is enough previously processed data. The findings in this thesis concludes that machine learning and OCR can be utilized to automatize manual labor.

Abstract [sv]

Företag behandlar oftast fakturor manuellt och en automatisering skulle kunna minska fysiskt arbete. Målet med examensarbetet var att undersöka vilken av OCR-läsarna, Tesseract och OCRopus som fungerar bäst på att tolka en inskannad faktura. Även undersöka om det är möjligt med maskininlärning att automatiskt behandla fakturor utifrån tidigare sparad data.

Genom att tolka text med hjälp av OCR-läsarna visade resultaten att den producerade texten blev språkligt korrekt, men att strukturen i fakturan inte behölls vilket gjorde det svårt att tolka vilka fält som hör ihop. Naïve Bayes valdes som algoritm till maskininlärningen och resultatet blev en prototyp som korrekt kunde klassificera återkommande fakturarader, efter att en mängd träningsdata var behandlad.

Slutsatsen är att ingen av OCR-läsarna kunde tolka fakturor så att resultatet kunde användas vidare, och att maskininlärning med Naïve Bayes fungerar på fakturor om tillräckligt med tidigare behandlad data finns. Utfallet av examensarbetet är att maskininlärning och OCR kan användas för att automatisera fysiskt arbete.

Place, publisher, year, edition, pages
2016. , 68 p.
Series
TRITA-STH, 2016:53
Keyword [en]
Machine learning, Naïve Bayes, OCR, OCRopus, Tesseract, Invoice handling
Keyword [sv]
Maskininlärning, Naïve Bayes, OCR, OCRopus, Tesseract, Fakturahantering
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:kth:diva-188202OAI: oai:DiVA.org:kth-188202DiVA: diva2:934351
External cooperation
DGC
Subject / course
Computer Technology, Program- and System Development
Educational program
Bachelor of Science in Engineering - Computer Engineering
Supervisors
Examiners
Available from: 2016-09-29 Created: 2016-06-08 Last updated: 2016-09-29Bibliographically approved

Open Access in DiVA

ThesisAndreasTony(1545 kB)219 downloads
File information
File name FULLTEXT01.pdfFile size 1545 kBChecksum SHA-512
ff797ac4c1ffb82769f203bbfe2d05d2c77d160e23d8b3fd046a3b80da15bff9f3e77f7f4eb076383d3ab1039648eff403a6a3f898a3d08d19e60c50df8f0e4d
Type fulltextMimetype application/pdf

By organisation
Computer and Electronic Engineering
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 219 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 723 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf