Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Language Technology for the Lazy: Avoiding Work by Using Statistics and Machine Learning
KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
2006 (English)Doctoral thesis, monograph (Other scientific)
Abstract [en]

Language technology is when a computer processes human languages in some way. Since human languages are irregular and hard to define in detail, this is often difficult. Despite this, good results can many times be achieved. Often a lot of manual work is used in creating these systems though. While this usually gives good results, it is not always desirable. For smaller languages the resources for manual work might not be available, since it is usually time consuming and expensive.

This thesis discusses methods for language processing where manual work is kept to a minimum. Instead, the computer does most of the work. This usually means basing the language processing methods on statistical information. These kinds of methods can normally be applied to other languages than they were originally developed for, without requiring much manual work for the language transition.

The first half of the thesis mainly deals with methods that are useful as tools for other language processing methods. Ways to improve part of speech tagging, which is an important part in many language processing systems, without using manual work, are examined. Statistical methods for analysis of compound words, also useful in language processing, is also discussed.

The first part is rounded off by a presentation of methods for evaluation of language processing systems. As languages are not very clearly defined, it is hard to prove that a system does anything useful. Thus it is very important to evaluate systems, to see if they are useful. Evaluation usually entails manual work, but in this thesis two methods with minimal manual work are presented. One uses a manually developed resource for evaluating other properties than originally intended with no extra work. The other method shows how to calculate an estimate of the system performance without using any manual work at all.

In the second half of the thesis, language technology tools that are in themselves useful for a human user are presented. This includes statistical methods for detecting errors in texts. These methods complement traditional methods, based on manually written error detection rules, for instance by being able to detect errors that the rule writer could not imagine that writers could make.

Two methods for automatic summarization are also presented. One is based on comparing the overall impression of the summary to that of the original text. This is based on statistical methods for measuring the contents of a text. The second method tries to mitigate the common problem of very sudden topic shifts in automatically generated summaries.

After this, a modified method for automatically creating a lexicon between two languages by using lexicons to a common intermediary language is presented. This type of method is useful since there are many language pairs in the world lacking a lexicon, but many languages have lexicons available with translations to one of the larger languages of the world, for instance English. The modifications were intended to improve the coverage of the lexicon, possibly at the cost of lower translation quality.

Finally a program for generating puns in Japanese is presented. The generated puns are not very funny, the main purpose of the program is to test the hypothesis that by using "bad word" things become a little bit more funny.

Place, publisher, year, edition, pages
Stockholm: KTH , 2006. , x, 150 p.
Series
Trita-CSC-A, ISSN 1653-5723 ; 2006:6
Keyword [en]
computer science
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-4023ISBN: 91-7178-356-3 (print)OAI: oai:DiVA.org:kth-4023DiVA: diva2:10439
Public defence
2006-06-14, Salongen, KTHB, Osquars backe 31, Stockholm, 14:00
Opponent
Supervisors
Note
QC 20100920Available from: 2006-06-01 Created: 2006-06-01 Last updated: 2010-09-20Bibliographically approved

Open Access in DiVA

fulltext(758 kB)281 downloads
File information
File name FULLTEXT01.pdfFile size 758 kBChecksum MD5
00684f26b2a8b4f48409e9566fa70e49687613927ba9677f0df64d8b713dd095821a6727
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Sjöbergh, Jonas
By organisation
Numerical Analysis and Computer Science, NADA
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 281 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 918 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf