Clustering protein sequences using a non-vectorial self-organizing map
Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
The last decades have seen an increasing use of databases storing sequence information for proteins. These databases have become large and are continually growing. Methods to reduce complexity and redundancy of these databases have been developed. Some methods employ the Self-Organizing Map (SOM) and its extension Median Self-Organizing Map (M-SOM) for clustering and visualisation of databases containing protein sequences. These methods have shown success in the clustering of protein sequences, but rely on the conversion of the symbolic protein sequences to real vectors in some stage of the training; either, using converted sequences during the whole training of the map in the case of the SOM, or using converted sequences during the initialization of the map in the case of the M-SOM. In this work, the feasibility of initializing the M-SOM with a method that avoids the conversionof the symbolic sequences to real vectors is investigated. This is doneby implementing an M-SOM and comparing the qualitative clustering of protein sequences when using either a random initialization, an initialization using a converted vectorial representation of the sequences, or a non-vectorial initialization that does not rely on conversion of the symbolic sequences. The different initialization methods are compared using topographic error, quantization error, frequency plots of mapped groups of categorized sequences, and U-matrices. Using these methods and measurements no denite reason to prefer either the vectorial or the non-vectorial initialization above the other could be determined, but the absence of a need for the conversion of sequences to real vectorscould favour the non-vectorial method.
Place, publisher, year, edition, pages
IdentifiersURN: urn:nbn:se:kth:diva-168105OAI: oai:DiVA.org:kth-168105DiVA: diva2:814431
Herman, Pawel, Assistant Professor