Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Klassificering av Named Entities utan handannoterad träningsdata.
KTH, School of Computer Science and Communication (CSC).
2012 (Swedish)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Christopher Raschke, D05

This report treats the subject of Named Entity Recognition. It starts by explaining the problem, why it is interesting and why it is useful. Two systems from one of the MUC conferences (a yearly conference that focuses on Information Retrieval oriented tasks) are studied. They where both Named Entity Classifiers but they completed their task in two very different ways. One solved the problem using a Machine Learner that used hand annotated training samples to learn the concept of Named Entity Recognition. The other one however, used a very different approach, it used predefined rules and pattern matching to solve the same problem. These systems became very important inspirational sources and provided ideas for the later development of my own Named Entity Recognition System. I developed three different classifiers, all based on the three designs found in the MUC-systems, and eventually combined them into a single classifier. The purpose of this report is to document the process of that development, from the ideas in the MUC-systems to the performance and results of the new one. It documents the progress as well as the road bumps and how I managed to get pass them. One of the major setbacks was the lack of hand annotated training data that later resulted in me developing my own method for generating data. The solution, which is explained in greater detail in the report, was to use a ?dump? of the English ?Wikipedia? and use it together with lists of (almost) sure-fire entities. The results are shown using automatically generated bar charts showing the performance with standard Information Retrieval measurements; ?Precision?, ?Recall? and ?F-Measure?. There are quite a few tests being made in the results section. The three different classifiers are tested both individually and using their combined efforts. Texts taken from different context are also used. The highest score among the result is achieved while using the combined efforts of the classifiers on factthemed texts. This combination yields a respectable 71% in F-Measure, 90% Precision and 58% Recall.

Abstract [sv]

Christopher Raschke, D05

Denna rapport tar upp och behandlar det språkteknologiska ämnet Named Entity Recognition. Den tar upp ämnet i sig, varför det är viktigt och hur andra har löst problemet. I det inledande kapitlet presenteras två tidigare utvecklade system. Dessa system utvecklades för att användas under MUC, som är en konferens där områden rörande Information Retrieval behandlas, och byggde på två helt skilda angreppssätt till problemet. Det första systemet byggde på en maskininlärare som lärde sig av på förhand handannoterad text. Det andra systemet använde sig av en helt annan teknik, den hittade Named Entities med hjälp av fördefinierade regler och mönstermatchningar. Dessa två system kom att bli den grund och inspirationskälla som hela arbetet grundades på. Under projektets gång togs tre klassificerare fram; en maskininlärningsbaserad, en regelbaserad och en listbaserad (lik mönstermatchning). Rapportens uppgift är att dokumentera processen kring utvecklingen av dessa system, upkomna problemen, samt resultat. Det största problemet vid utvecklingen av systemet var avsaknaden av annoterad träningsdata. Detta problem löste jag genom att ladda ner en dump av engelska wikipedia och sedan använda listor med säkra entiteter för att automatiskt generera träningsdata. Resultaten presenteras med hjälp av genererade stolpdiagram som använder sig av de inom Information Retrieval ganska standardiserade måtten; Precision, Recall och F-measure. Test utfärs inte bara för varje klassificerare för sig utan även med kombinationer av klassificerare. Testerna utförs på tre olika typer av text: faktatext, nyhetstext och boktext. Det bästa resultatet erhålls när alla tre klassificerare kombineras för att märka upp en faktatext. Vid detta test lyckades systemet uppnå en F-Measure på 71%, Precision på 90% och en Recall på 58%.

Place, publisher, year, edition, pages
2012.
Series
Trita-CSC-E, ISSN 1653-5715 ; 2012:035
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-130982OAI: oai:DiVA.org:kth-130982DiVA, id: diva2:654428
Educational program
Master of Science in Engineering - Computer Science and Technology
Uppsok
Technology
Supervisors
Examiners
Available from: 2013-10-07 Created: 2013-10-07 Last updated: 2018-01-11

Open Access in DiVA

No full text in DiVA

Other links

http://www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/2012/rapporter12/raschke_christopher_12035.pdf
By organisation
School of Computer Science and Communication (CSC)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 546 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf