kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Word Classes in Language Modelling
KTH, School of Engineering Sciences (SCI).
KTH, School of Engineering Sciences (SCI).
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

This thesis concerns itself with word classes and their application to language modelling.Considering a purely statistical Markov model trained on sequences of word classes in theSwedish language different problems in language engineering are examined. Problemsconsidered are part-of-speech tagging, evaluating text modifiers such as translators withthe help of probability measurements and matrix norms, and lastly detecting differenttypes of text using the Fourier transform of cross entropy sequences of word classes.The results show that the word class language model is quite weak by itself but that itis able to improve part-of-speech tagging for 1 and 2 letter models. There are indicationsthat a stronger word class model could aid 3-letter and potentially even stronger models.For evaluating modifiers the model is often able to distinguish between shuffled andsometimes translated text as well as to assign a score as to how much a text has beenmodified. Future work on this should however take better care to ensure large enoughtest data. The results from the Fourier approach indicate that a Fourier analysis of thecross entropy sequence between word classes may allow the model to distinguish betweenA.I. generated text as well as translated text from human written text. Future work onmachine learning word class models could be carried out to get further insights into therole of word class models in modern applications. The results could also give interestinginsights in linguistic research regarding word classes.

Place, publisher, year, edition, pages
2024.
Series
TRITA-SCI-GRU ; 2024:152
Keywords [en]
Word class, Language Model, POS-tagging, n-gram, Markov Model, Transition Matrix, Matrix norm, Cross Entropy, Discrete Fourier Transform
National Category
Mathematics
Identifiers
URN: urn:nbn:se:kth:diva-349074OAI: oai:DiVA.org:kth-349074DiVA, id: diva2:1879630
Educational program
Master of Science in Engineering -Engineering Physics
Supervisors
Examiners
Available from: 2024-06-28 Created: 2024-06-28 Last updated: 2024-06-28Bibliographically approved

Open Access in DiVA

fulltext(1393 kB)255 downloads
File information
File name FULLTEXT01.pdfFile size 1393 kBChecksum SHA-512
431861226264749690094ee81b4758b8ab8e623e1a3983041233d72cfd228e97c8964de381ced973cce473ecda96e4ea2561de20d7dec3961361d74435b7bdfc
Type fulltextMimetype application/pdf

By organisation
School of Engineering Sciences (SCI)
Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 255 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 304 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf