Change search
ReferencesLink to record
Permanent link

Direct link
Multiple Entity Reconciliation
KTH, School of Information and Communication Technology (ICT).
2015 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Living in the age of "Big Data" is both a blessing and a curse. On he one hand, the raw data can be analysed and then used for weather redictions, user recommendations, targeted advertising and more. On he other hand, when data is aggregated from multiple sources, there is no guarantee that each source has stored the data in a standardized or even compatible format to what is required by the application. So there is a need to parse the available data and convert it to the desired form. Here is where the problems start to arise: often the correspondences are not quite so straightforward between data instances that belong to the same domain, but come from different sources. For example, in the film industry, information about movies (cast, characters, ratings etc.) can be found on numerous websites such as IMDb or Rotten Tomatoes. Finding and matching all the data referring to the same movie is a challenge.

The aim of this project is to select the most efficient algorithm to correlate movie related information gathered from various websites automatically. We have implemented a flexible application that allows us to make the performance comparison of multiple algorithms based on machine learning techniques. According to our experimental results, a well chosen set of rules is on par with the results from a neural network, these two proving to be the most effective classifiers for records with movie information as content.

Place, publisher, year, edition, pages
2015. , 63 p.
TRITA-ICT-EX, 2015:211
Keyword [en]
entity matching, data linkage, data quality, machine learning, text processing
National Category
Computer and Information Science
URN: urn:nbn:se:kth:diva-187010OAI: diva2:928531
Available from: 2016-05-16 Created: 2016-05-16 Last updated: 2016-05-16Bibliographically approved

Open Access in DiVA

fulltext(970 kB)24 downloads
File information
File name FULLTEXT01.pdfFile size 970 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 24 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 6 hits
ReferencesLink to record
Permanent link

Direct link