Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Clustering protein sequences using a non-vectorial self-organizing map
KTH, School of Engineering Sciences (SCI).
KTH, School of Engineering Sciences (SCI).
2015 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

The last decades have seen an increasing use of databases storing sequence information for proteins. These databases have become large and are continually growing. Methods to reduce complexity and redundancy of these databases have been developed. Some methods employ the Self-Organizing Map (SOM) and its extension Median Self-Organizing Map (M-SOM) for clustering and visualisation of databases containing protein sequences. These methods have shown success in the clustering of protein sequences, but rely on the conversion of the symbolic protein sequences to real vectors in some stage of the training; either, using converted sequences during the whole training of the map in the case of the SOM, or using converted sequences during the initialization of the map in the case of the M-SOM. In this work, the feasibility of initializing the M-SOM with a method that avoids the conversionof the symbolic sequences to real vectors is investigated. This is doneby implementing an M-SOM and comparing the qualitative clustering of protein sequences when using either a random initialization, an initialization using a converted vectorial representation of the sequences, or a non-vectorial initialization that does not rely on conversion of the symbolic sequences. The different initialization methods are compared using topographic error, quantization error, frequency plots of mapped groups of categorized sequences, and U-matrices. Using these methods and measurements no denite reason to prefer either the vectorial or the non-vectorial initialization above the other could be determined, but the absence of a need for the conversion of sequences to real vectorscould favour the non-vectorial method.

Place, publisher, year, edition, pages
2015.
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-168105OAI: oai:DiVA.org:kth-168105DiVA: diva2:814431
Supervisors
Available from: 2015-05-27 Created: 2015-05-27 Last updated: 2015-05-27Bibliographically approved

Open Access in DiVA

No full text

By organisation
School of Engineering Sciences (SCI)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 226 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf