kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Applying Natural Language Processing to document classification
KTH, School of Electrical Engineering and Computer Science (EECS).
2022 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Tillämpning av Naturlig Språkbehandling för dokumentklassificering (Swedish)
Abstract [en]

In today's digital world, we produce and use more electronic documents than ever before. And this trend is far from slowing down. Particularly, more and more companies and businesses now need to treat a considerable amount of documents to deal with their clients' requests. Scaling this process often requires building an automatic document treatment pipeline. Since the treatment of a document depends on its content, those pipelines heavily rely on an automatic document classifier to correctly process the documents received. Such document classifier should be able to receive a document of any type and output its class based on the text content of the document. In this thesis, we designed and implemented a machine learning pipeline for automated insurance claims documents classification. In order to find the best pipeline, we created several combination of different classifiers (logistic regressor and random forest classifier) and embedding models (Fasttext and Doc2vec). We then compared the performances of all of the pipelines using a the precision and accuracy metrics. We found that a pipeline composed of a Fasttext embedding model combined with a logistic regressor classifier was the most performant, yielding a precision of 85% and an accuracy of 86% on our dataset.

Abstract [sv]

I dagens digitala värld, producerar och använder vi fler elektroniska dokument än någonsin tidigare. Denna trend är långt ifrån att sakta ner sig. Särskilt fler och fler företag behöver nu behandla en stor mängd dokument för att hantera sina kunders önskemål. Att skala denna process kräver ofta att man bygger en pipeline för automatisk dokumentbehandling. Eftersom behandlingen av ett dokument beror på dess innehåll, är dessa pipelines starkt beroende av en automatisk dokumentklassificerare för att korrekt bearbeta de mottagna dokumenten. En sådan dokumentklassificerare skall kunna ta emot ett dokument av vilken typ som helst och mata ut dess klass baserat på dokumentets textinnehåll. I detta examensarbete, designade och implementerade vi en maskininlärningspipeline för automatiserad klassificering av försäkringskrav-dokument. För att hitta den bästa pipelinen, skapade vi flera kombinationer av olika klassificerare (logistisk regressor och random forest klassificerare) och inbäddningsmodeller (Fasttext och Doc2vec). Vi jämförde sedan prestandan för alla pipelines med hjälp av precisions- och noggrannhetsmåtten. Vi fann att en pipeline bestående av en Fasttext-inbäddningsmodell kombinerad med en logistisk regressorklassificerare var den mest presterande, vilket gav en precision på 85% och en noggrannhet på 86% på vår datauppsättning.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology , 2022. , p. 36
Series
TRITA-EECS-EX ; 2022:745
Keywords [en]
Natural Language Processing, Document Classification, Embeddings, Classifiers
Keywords [sv]
Naturlig Språkbehandling, Dokumentklassificering, Inbäddningar, Klassificerare
National Category
Computer Sciences Computer Engineering Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-321658OAI: oai:DiVA.org:kth-321658DiVA, id: diva2:1712026
External cooperation
Shift Technology
Subject / course
Computer Science
Educational program
Master of Science in Engineering - Computer Science and Technology
Presentation
2022-08-08, via Zoom https://kth-se.zoom.us/j/65554570355, Paris, 10:00 (English)
Supervisors
Examiners
Available from: 2023-01-21 Created: 2022-11-19 Last updated: 2025-01-27Bibliographically approved

Open Access in DiVA

fulltext(349 kB)1156 downloads
File information
File name FULLTEXT01.pdfFile size 349 kBChecksum SHA-512
e59bc4d747a4f177cae3a3945de928cf2563872e73474bf359c3851f948be5ae3652ede08b3ac0ac342ac74d2b55231c21d02b66d0528a07005d3c256ec2bb42
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer SciencesComputer EngineeringComputer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1156 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 354 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf