kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving a Few-shot Named Entity Recognition Model Using Data Augmentation
KTH, School of Electrical Engineering and Computer Science (EECS).
2022 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Förbättring av en existerande försöksmodell för namnidentifiering med få exempel genom databerikande åtgärder (Swedish)
Abstract [en]

To label words of interest into a predefined set of named entities have traditionally required a large amount of labeled in-domain data. Recently, the availability of pre-trained transformer-based language models have enabled multiple natural language processing problems to utilize transfer learning techniques to construct machine learning models with less task-specific labeled data. In this thesis, the impact of data augmentation when training a pre-trained transformer-based model to adapt to a named entity recognition task with few labeled sentences is explored. The experimental results indicate that data augmentation increases performance of the trained models, however the data augmentation is shown to have less impact when more labeled data is available. In conclusion, data augmentation has been shown to improve performance of pre-trained named entity recognition models when few labeled sentences are available for training.

Abstract [sv]

Att kategorisera ord som tillhör någon av en mängd förangivna entiteter har traditionellt krävt stora mängder förkategoriserad områdesspecifik data. På senare år har det tillgängliggjorts förtränade språkmodeller som möjliggjort för språkprocesseringsproblem att lösas med en mindre mängd områdesspecifik kategoriserad data. I den här uppsatsen utforskas datautöknings påverkan på en maskininlärningsmodell för identifiering av namngivna entiteter. De experimentella resultaten indikerar att datautökning förbättrar modellerna, men att inverkan blir mindre när mer kategoriserad data är tillgänglig. Sammanfattningsvis så kan datautökning förbättra modeller för identifiering av namngivna entiteter när få förkategoriserade meningar finns tillgängliga för träning.

Place, publisher, year, edition, pages
2022. , p. 40
Series
TRITA-EECS-EX ; 2022:211
Keywords [en]
Named Entity Recognition, Data Augmentation, Self-training, BERT, Few-shot Learning
Keywords [sv]
Identifiering av namngivna entiteter, Datautökning, Självträning, BERT, Fåförsöksinlärning
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-318546OAI: oai:DiVA.org:kth-318546DiVA, id: diva2:1697686
External cooperation
Findwise
Subject / course
Computer Science
Educational program
Master of Science - Computer Science
Supervisors
Examiners
Available from: 2022-09-22 Created: 2022-09-21 Last updated: 2022-09-22Bibliographically approved

Open Access in DiVA

fulltext(512 kB)758 downloads
File information
File name FULLTEXT01.pdfFile size 512 kBChecksum SHA-512
43ddae89a6dcd68fce834e7c6d9787b2e2994a01e556f7adb70ad56b70ecbd4eee7ff1738d8ea93e19b792943beadc916af35f402d42254b5c60ff381f35d85d
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 768 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 538 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf