Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Creating a reusable English-Chinese parallel corpus for bilingual dictionary construction
KTH, Skolan för informations- och kommunikationsteknik (ICT).
KTH, Skolan för informations- och kommunikationsteknik (ICT).
2010 (Engelska)Ingår i: Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010, European Language Resources Association (ELRA) , 2010, s. 1700-1705Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus. Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese. Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an accuracy of 73.1 percent.

Ort, förlag, år, upplaga, sidor
European Language Resources Association (ELRA) , 2010. s. 1700-1705
Nyckelord [en]
Translation (languages), Bilingual dictionary, Chinese characters, Chinese writings, English-chinese parallel corpora, Parallel corpora, Pre-processing step, Word alignment, Word segmentation, Alignment
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
URN: urn:nbn:se:kth:diva-223005ISI: 000356879506079Scopus ID: 2-s2.0-85037526071ISBN: 2951740867 (tryckt)ISBN: 9782951740860 (tryckt)OAI: oai:DiVA.org:kth-223005DiVA, id: diva2:1183398
Konferens
7th International Conference on Language Resources and Evaluation, LREC 2010, 17 May 2010 through 23 May 2010
Anmärkning

QC 20180216

Tillgänglig från: 2018-02-16 Skapad: 2018-02-16 Senast uppdaterad: 2018-02-16Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Scopus

Personposter BETA

Dalianis, HerculesXing, Haochun

Sök vidare i DiVA

Av författaren/redaktören
Dalianis, HerculesXing, Haochun
Av organisationen
Skolan för informations- och kommunikationsteknik (ICT)
Språkteknologi (språkvetenskaplig databehandling)

Sök vidare utanför DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 9 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf