From Pages to Places: Using Large Language Models to Extract PERSON-PLACE Relations from Wikipedia Articles
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesisAlternative title
Från sidor till platser: användning av Stora Språkmodeller (LLM:er) för att extrahera PERSON-PLATS-relationer från Wikipediaartiklar (Swedish)
Abstract [en]
This thesis aims to explore the capabilities of Large Language Models (LLM) for the extractionof PERSON-PLACE relations from biographic text. The use of these models would reducethe required knowledge and training data from established methods and ease the generation ofspatial data from those text sources. An easier extraction method could advance the generationof spatial datasets based on biographies in many areas.With a newly compiled PERSON-PLACE relation dataset based on Wikipedia biographyarticles, the capabilities of an example LLM are compared with the performance of a baselinemethod based on the spaCy Python package. The baseline works by looping through extractedsentences and named entities, while the LLM extraction is based on few-shot prompting. Thecomparison is done by evaluating a sentence-based F1 score and a secondary document-basedFβ. The data set gives insight into the distributions of the types of relations and spatial entities.The results also show that the baseline outperforms the Large Language Model in almost allexamined metrics with regard to the F1 scores. The secondary analysis Fβ reveals more equalperformances. The under performance of the Large Language Model is likely based on thecomplexity of the extraction task, strictness of the comparison, and the rudimentary interactionswith it. However, the novel dataset and developed methods lay the groundwork for furtherresearch in this area. The specific few-shot method and comparison do not show promise forthe future.
Place, publisher, year, edition, pages
2024.
Series
TRITA-ABE-MBT ; 2559
Keywords [en]
Large Language Model, Relation Extraction, Wikipedia Biographies, Geo-Text, spaCy, Fewshot prompting
National Category
Other Computer and Information Science Other Health Sciences
Identifiers
URN: urn:nbn:se:kth:diva-362080OAI: oai:DiVA.org:kth-362080DiVA, id: diva2:1950021
Presentation
2024-12-04, 00:00 (English)
Supervisors
Examiners
2025-04-042025-04-042025-04-07Bibliographically approved