Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
IBM Model 4 Alignment Comparison: An evaluation of how the size of training data affects the interpretation accuracy and training time for two alignment models that translates natural language
KTH, School of Computer Science and Communication (CSC).
KTH, School of Computer Science and Communication (CSC).
2016 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Jämförelse av IBM Model 4-alignment : En jämförelse av hur storleken på träningsdata påverkar tolkningsnoggrannheten och träningstiden för två alignment-modeller som översätter naturligt språk (Swedish)
Abstract [en]

In modern society the amount of information processed by computers is increasing everyday. Computer translation has the potential to speed up communication between humans as well as human-computer interactions. For Statistical Machine Translation word alignment is key. How large does a corpus need to be to align a natural language sentence with a simple unambiguous language? We investigate this matter by running a simple algorithm and comparing it to the results we get from an industry equivalent. The results show that the size of the corpus needs to be larger for the simplified model when there is a greater number of words per sentence. The IBM Model 4 conversely shows that the more words per sentence decrease the necessary size of the corpus to make better predictions.Thus we can conclude that corpus size is dependant on the number of terms in each sentence for both models.

Abstract [sv]

I vårat moderna samhälle bearbetas mer information för varje dag. Datoriserad översättning har potentialen att öka hastigheten utav kommunikationen mellan människor emellan samt människa-datorinteraktion. För Statistical Machine Translation så är word alignment en stor del. Hur stor måste en korpus vara för att man med stor sannolikhet lyckats att korrekt översätta meningar från ett naturligt språk med ett simpelt entydigtspråk?Vi testar detta genom att jämföra en simpel algorithm med en algoritm som används inom industrin. I resultaten ser vi att ju mer ord som finns i meningen som ska översättas, ju större måste korpusen vara. Med IBM Model 4 ser vi att resultaten blir bättre med ju fler ord per mening och därför kan korpusstorleken minskas. Vår slutsats är att korpus storleken beror på mängden aritmetiska termer för båda modellerna.

Place, publisher, year, edition, pages
2016.
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-187744OAI: oai:DiVA.org:kth-187744DiVA: diva2:931496
Supervisors
Examiners
Available from: 2016-05-28 Created: 2016-05-28 Last updated: 2016-05-28Bibliographically approved

Open Access in DiVA

fulltext(1237 kB)99 downloads
File information
File name FULLTEXT01.pdfFile size 1237 kBChecksum SHA-512
98673321f602e5670ab5789f949fbedc762e8c7c25b391de576177569a3bdfcf74ec3f4a9c4b0f381fd2aa149263285259d40930461dbb2b05f3ea6503fbeeec
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Siebecke, MariaArvidson, Tor
By organisation
School of Computer Science and Communication (CSC)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 99 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 66 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf