Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automated Dataset Construction from Web Resources with Tool Kayur
KTH, School of Computer Science and Communication (CSC).
2016 (English)In: 2016 FOURTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2016, 98-104 p.Conference paper (Refereed)
Abstract [en]

Many text mining tools cannot be applied directly to documents available on web pages. There are tools for fetching and preprocessing of textual data, but combining them in one working tool chain can be time consuming. The preprocessing task is even more labor-intensive if documents are located on multiple remote sources with different storage formats. In this paper we propose the simplification of data preparation process for cases when data come from wide range of web resources. We developed an open-sourced tool, called Kayur, that greatly minimizes time and effort required for routine data preprocessing steps, allowing to quickly proceed to the main task of data analysis. The datasets generated by the tool are ready to be loaded into a data mining workbench, such as WEKA or Carrot2, to perform classification, feature prediction, and other data mining tasks.

Place, publisher, year, edition, pages
2016. 98-104 p.
Series
International Symposium on Computing and Networking, ISSN 2379-1888
Keyword [en]
automation, information extraction, natural language processing, web content mining
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-203189DOI: 10.1109/CANDAR.2016.71ISI: 000393284200012ISBN: 978-1-5090-2655-5 OAI: oai:DiVA.org:kth-203189DiVA: diva2:1081286
Conference
4th International Symposium on Computing and Networking (CANDAR), NOV 22-25, 2016, Hiroshima, JAPAN
Note

QC 20170313

Available from: 2017-03-13 Created: 2017-03-13 Last updated: 2017-03-13Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Artho, Cyrille
By organisation
School of Computer Science and Communication (CSC)
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar

Altmetric score

CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf