kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatiserad nyckel-värdeextraktion från dokument med LLM:s
KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Biomedical Engineering and Health Systems, Health Informatics and Logistics.
KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Biomedical Engineering and Health Systems, Health Informatics and Logistics.
2025 (Swedish)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Automated key value extraction from documents with LLMs (English)
Abstract [sv]

Fastighetsbolag hanterar ofta stora mängder teknisk dokumentation i form avPDF-filer. Dessa dokument är ofta ostrukturerade och svåra att söka i, vilketförsvårar insamling av information som behövs för till exempelhållbarhetsrapportering och digitalisering. Syftet med detta examensarbete varatt utveckla en automatiserad metod för att extrahera relevanta nyckel‑värde‑parfrån sådana dokument och presentera dem i ett maskinläsbart format. Arbetetundersökte hur olika språkmodeller presterar i denna uppgift, samt hur olikapromptningsstrategier för instruktioner påverkar resultaten. En prototyputvecklades och testades, där text först extraherades ur PDF-dokument och sedanbearbetades av språkmodeller för att identifiera relevanta datapunkter. Trotsvissa positiva tendenser nådde ingen av modellerna en återkallelse som kanklassas som hög, snarare låg till medelhög, särskilt när rätt kombination avmodell och metod används. Arbetet visar att tekniken är lovande för atteffektivisera informationshantering i branscher med mycket ostruktureraddokumentation, även om ytterligare utveckling krävs innan metoden kananvändas fullt automatiserat i praktiken.

Abstract [en]

Property management companies often handle large volumes of technicaldocumentation in the form of PDF files. These documents are typicallyunstructured and difficult to search, which complicates the process of gatheringinformation needed for sustainability reporting and digitization efforts. The aimof this thesis was to develop an automated method for extracting relevant key–value pairs from such documents and presenting them in a machine-readableformat. The study explored how different language models perform in this taskand how various prompting strategies affect the results. A prototype wasdeveloped and tested, where text was first extracted from PDF documents andthen processed by language models to identify relevant data points. Despite somepositive tendencies, none of the models reached a recall that can be classified ashigh, rather low to medium, especially when the right combination of model andmethod is used. The work demonstrates that this approach holds promise forimproving information management in industries that rely on large volumes ofunstructured documentation, although further development is needed before fullautomation can be achieved in practice. 

Place, publisher, year, edition, pages
2025. , p. 64
Series
TRITA-CBH-GRU ; 084
Keywords [en]
automation, text extraction, large language models, technical documents, PDF, information management, key-value pairs, property data
Keywords [sv]
automatisering, textextraktion, stora språkmodeller, tekniska dokument, PDF, informationshantering, nyckel-värde-par, fastighetsdata
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:kth:diva-364273OAI: oai:DiVA.org:kth-364273DiVA, id: diva2:1965691
External cooperation
Property value AB
Educational program
Bachelor of Science in Engineering - Computer Engineering
Supervisors
Examiners
Available from: 2025-06-09 Created: 2025-06-09 Last updated: 2025-06-09Bibliographically approved

Open Access in DiVA

Automatiserad nyckel-värdeextraktion från dokument med LLM:s(816 kB)70 downloads
File information
File name FULLTEXT01.pdfFile size 816 kBChecksum SHA-512
d3b91b382720742f712742f9ad23f9ad66fecc6b923ddd38d6a5094e5556f92b68dc39fe5dd807abcefa9e521af237285f7b2f9b5b496634dbb14e5b526bf681
Type fulltextMimetype application/pdf

By organisation
Health Informatics and Logistics
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 70 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 256 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf