Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Automatic and unsupervised methods in natural language processing
KTH, Skolan för informations- och kommunikationsteknik (ICT), Data- och systemvetenskap, DSV.
2005 (Engelska)Doktorsavhandling, monografi (Övrigt vetenskapligt)
Abstract [en]

Natural language processing (NLP) means the computer-aided processing of language produced by a human. But human language is inherently irregular and the most reliable results are obtained when a human is involved in at least some part of the processing. However, manual workis time-consuming and expensive. This thesis focuses on what can be accomplished in NLP when manual workis kept to a minimum.

We describe the construction of two tools that greatly simplify the implementation of automatic evaluation. They are used to implement several supervised, semi-supervised and unsupervised evaluations by introducing artificial spelling errors. We also describe the design of a rule-based shallow parser for Swedish called GTA and a detection algorithm for context-sensitive spelling errors based on semi-supervised learning, called ProbCheck.

In the second part of the thesis, we first implement a supervised evaluation scheme that uses an error-free treebankto determine the robustness of a parser when faced with noisy input such as spelling errors. We evaluate the GTA parser and determine the robustness of the individual components of the parser as well as the robustness for different phrase types. Second, we create an unsupervised evaluation procedure for parser robustness. The procedure allows us to evaluate the robustness of parsers using different parser formalisms on the same text and compare their performance. Five parsers and one tagger are evaluated. For four of these, we have access to annotated material and can verify the estimations given by the unsupervised evaluation procedure. The results turned out to be very accurate with few exceptions and thus, we can reliably establish the robustness of an NLP system without any need of manual work.

Third, we implement an unsupervised evaluation scheme for spell checkers. Using this, we perform a very detailed analysis of three spell checkers for Swedish. Last, we evaluate the ProbCheck algorithm. Two methods are included for comparison: a full parser and a method using tagger transition probabilities. The algorithm obtains results superior to the comparison methods. The algorithm is also evaluated on authentic data in combination with a grammar and spell checker.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH , 2005. , s. xi, 120
Serie
Trita-NA, ISSN 0348-2952 ; 2005:08
Nyckelord [en]
Datalogi
Nyckelord [sv]
Datalogi
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:kth:diva-156ISBN: 91-7283-982-1 (tryckt)OAI: oai:DiVA.org:kth-156DiVA, id: diva2:7478
Disputation
2005-04-08, Kollegiesalen, KTH, Valhallavägen 79, Stockholm, 14:15
Opponent
Handledare
Anmärkning
QC 20100901Tillgänglig från: 2005-04-04 Skapad: 2005-04-04 Senast uppdaterad: 2018-01-11Bibliografiskt granskad

Open Access i DiVA

fulltext(611 kB)1641 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 611 kBChecksumma MD5
c1392aab0c5fa94a4ce22d2c0e6b184394b1b97e2981939088b3ce4f87be1b68ea3c2544
Typ fulltextMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Bigert, Johnny
Av organisationen
Data- och systemvetenskap, DSV
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 1641 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 1340 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf