CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Multimodal Relation Extraction of Product Categories in Market Research Data
KTH, School of Electrical Engineering and Computer Science (EECS).
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Nowadays, large amounts of unstructured data are constantly being generated and made available through websites and documents. Relation extraction, the task of automatically extracting semantic relationships between entities from such data, is therefore considered to have high commercial value today. However, many websites and documents are richly formatted, i.e., they communicate information through non-textual expressions such as tabular or visual elements. This thesis proposes a framework for relation extraction from such data, in particular, documents from the market research area. The framework performs relation extraction by applying supervised learning using both textual and visual features from PDF documents. Moreover, it allows the user to train a model without any manually labeled data by implementing labeling functions.We evaluate our framework by extracting relations from a corpus of market research documents on consumer goods. The extracted relations associate categories to products of different brands. We find that our framework outperforms a simple baseline model, although we are unable to show the effectiveness of incorporating visual features on our test set. We conclude that our framework can serve as a prototype for relation extraction from richly format-ted data, although more advanced techniques are necessary to make use of non-textual features.

Abstract [sv]

Nuförtiden genereras ständigt stora mängder ostrukturerad data som blir tillgänglig genom hemsidor och dokument. Relationsextrahering, d.v.s. uppgiften att extrahera semantiska relationer från sådan data, anses därför ha högt kommersiellt värde. Däremot så är många hemsidor och dokument rikt formaterade, d.v.s. att de kommunicerar information genom icke-textliga uttryck, till exempel genom tabulära eller visuella element. Den här uppsatsen kommer att presentera ett ramverk för relationsextrahering från sådan data, med fokus på data som berör marknadsundersökningar. Ramverket utför relationsextrahering via övervakad inlärning genom att använda både textliga och visuella särdrag hos dokumenten. Därtill så tillåter ramverket användaren att träna en modell utan att använda manuellt annoterad data genom att implementera så kallade annoteringsfunktioner.Vi evaluerar vårt ramverk på en samling marknadsundersökningsdokument som berör konsumtionsvaror. De extraherade relationerna kopplar kategorier till produkter av olika märken. Vi finner att vårt ramverk är bättre än en trivial modell, men vi lyckas inte påvisa någon större positiv effekt av att utnyttja visuella egenskaper hos dokumenten. Vi drar slutsatsen att vårt ramverk kan fungera som en prototyp för relationsextrahering från rikt formaterad data, men att mer avancerade metoder är nödvändiga för att utnyttja icke-textliga dokumentegenskaper.

Place, publisher, year, edition, pages
2019. , p. 58
Series
TRITA-EECS-EX ; 2019:791
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-271190OAI: oai:DiVA.org:kth-271190DiVA, id: diva2:1415933
Subject / course
Computer Science
Educational program
Master of Science - Machine Learning
Supervisors
Examiners
Available from: 2020-03-20 Created: 2020-03-20 Last updated: 2020-03-20Bibliographically approved

Open Access in DiVA

fulltext(3191 kB)3 downloads
File information
File name FULLTEXT01.pdfFile size 3191 kBChecksum SHA-512
ade8582d00d0a408bbee835db60f8f062c1492caa4245480d917170bf4e0688af95d083e30f533de6f290a65f5baf664f57a32b26b577e6c31130dc780d27dd4
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 3 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 11 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf