kth.sePublications KTH
4445464748495047 of 157
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
AI-stöd för kvalitetssäkring ochinformationssökning i tekniskdokumentation: En jämförelse mellan regelbaserade och AI-baserade metoderför GDPR-detektion och semantisk sökning
KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Biomedical Engineering and Health Systems, Health Informatics and Logistics.
KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Biomedical Engineering and Health Systems, Health Informatics and Logistics.
2026 (Swedish)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
AI-assisted quality assurance and information retrieval in technical documentation : A comparison between rule-based and AI-based methods forGDPR detection and semantic search (English)
Abstract [sv]

I IT-organisationer växer mängden intern dokumentation kontinuerligt, vilketmedför två utbredda problem: känslig information exponeras oavsiktligt i löpandetext och befintliga söksystem hittar inte relevant innehåll om exakt rätt ord inteanvänds. Traditionella verktyg klarar inte av att hantera dessa problem eftersom desaknar förmåga att tolka sammanhang. Detta examensarbete undersöker hur AIbaserade metoder kan användas för att automatisera identifiering av känsliginformation och förbättra informationsåtervinning i teknisk dokumentation lagrad iwiki-miljöer. Arbetet är utfört hos IT-tjänsteföretaget Axians.Arbetet utvärderar tre metoder för identifiering av känslig information: reguljärauttryck, Microsoft Presidio och en generativ språkmodell, samt jämför nyckelordsbaserad sökning med semantisk sökning och en hybridmetod. Metoderna testadespå syntetiskt genererade dokument som efterliknar teknisk IT-dokumentation påsvenska och engelska. Resultaten från det syntetiska testsetet indikerar att dengenerativa språkmodellen Mistral 7B identifierar känslig information med högträffsäkerhet, medan de regelbaserade metoderna missar en stor del av dekontextberoende entiteterna. För sökning uppnådde det semantiska systemet bättreresultat i det använda testsetet, särskilt på synonymfrågor och tvärspråkligasökningar. Slutsatsen är att AI-baserade metoder har stor potential att förbättrabåde datasäkerhet och informationstillgänglighet i tekniska dokumentationsmiljöer.

Abstract [en]

In IT organizations, the volume of internal documentation grows continuously, giving rise to two widespread problems: sensitive information is unintentionally exposed in plain text, and existing search functionality fails to retrieve relevant contentunless exact terminology is used. Traditional tools are insufficient for addressingthese problems as they lack the ability to interpret context. This thesis investigateshow AI-based methods can be used to automate the identification of sensitive information and improve information retrieval in technical documentation stored in wikienvironments. The work was conducted at the IT services company Axians.The study evaluates three methods for identifying sensitive information: regular expressions, Microsoft Presidio and a generative language model, and compares keyword-based search with semantic search and a hybrid method. The methods weretested on synthetically generated documents resembling technical IT documentationin Swedish and English. The results from the synthetic test set indicate that the generative language model Mistral 7B identifies sensitive information with high recall,while the rule-based methods miss a substantial portion of the context-dependententities. For information retrieval, the semantic system achieved better results in theevaluated test set, particularly on synonym queries and cross-lingual searches. Theconclusion is that AI-based methods have significant potential to improve both datasecurity and information accessibility in technical documentation environments.

Place, publisher, year, edition, pages
2026. , p. 58
Series
TRITA-CBH-GRU ; 2026:131
Keywords [en]
GDPR detection, semantic search, technical documentation, large language models, information retrieval, AI, wiki, rule-based methods, embeddings, Microsoft Presidio
Keywords [sv]
GDPR-detektion, semantisk sökning, teknisk dokumentation, stora språkmodeller, informationsåtervinning, AI, wiki, regelbaserade metoder, RAG, Microsoft Presidio
National Category
Artificial Intelligence
Identifiers
URN: urn:nbn:se:kth:diva-382846OAI: oai:DiVA.org:kth-382846DiVA, id: diva2:2064943
Subject / course
Computer Technology, Program- and System Development
Educational program
Bachelor of Science in Engineering - Computer Engineering
Supervisors
Examiners
Available from: 2026-06-03 Created: 2026-06-02 Last updated: 2026-06-03Bibliographically approved

Open Access in DiVA

fulltext(856 kB)17 downloads
File information
File name FULLTEXT01.pdfFile size 856 kBChecksum SHA-512
20c0aa2c4e5da3752e2a7a98ce8c38ec9815cb7f03a7a6316a3dad3852f6e50442101d7dde7e72dcc8acad222e9e86ef7f99fb355f2d0a272683f7d1ed463492
Type fulltextMimetype application/pdf

By organisation
Health Informatics and Logistics
Artificial Intelligence

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 152 hits
4445464748495047 of 157
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf