Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Computational Problems in Modeling Evolution and Inferring Gene Families.
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).ORCID iD: 0000-0003-4937-0670
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Over the last few decades, phylogenetics has emerged as a very promising field, facilitating a comparative framework to explain the genetic relationships among all the living organisms on earth. These genetic relationships are typically represented by a bifurcating phylogenetic tree — the tree of life. Reconstructing a phylogenetic tree is one of the central tasks in evolutionary biology. The different evolutionary processes, such as gene duplications, gene losses, speciation, and lateral gene transfer events, make the phylogeny reconstruction task more difficult. However, with the rapid developments in sequencing technologies and availability of genome-scale sequencing data, give us the opportunity to understand these evolutionary processes in a more informed manner, and ultimately, enable us to reconstruct genes and species phylogenies more accurately. This thesis is an attempt to provide computational methods for phylogenetic inference and give tools to conduct genome-scale comparative evolutionary studies, such as detecting homologous sequences and inferring gene families.

In the first project, we present FastPhylo as a software package containing fast tools for reconstructing distance-based phylogenies. It implements the previously published efficient algorithms for estimating a distance matrix from the input sequences and reconstructing an un-rooted Neighbour Joining tree from a given distance matrix. Results on simulated datasets reveal that FastPhylo can handles hundred of thousands of sequences in a minimum time and memory efficient manner. The easy to use, well-defined interfaces, and the modular structure of FastPhylo allows it to be used in very large Bioinformatic pipelines.

In the second project, we present a synteny-aware gene homology method, called GenFamClust (GFC) that uses gene content and gene order conservation to detect homology. Results on simulated and biological datasets suggest that local synteny information combined with the sequence similarity improves the detection of homologs.

In the third project, we introduce a novel phylogeny-based clustering method, PhyloGenClust, which partitions a very large gene family into smaller subfamilies. ROC (receiver operating characteristics) analysis on synthetic datasets show that PhyloGenClust identify subfamilies more accurately. PhyloGenClust can be used as a middle tier clustering method between raw clustering methods, such as sequence similarity methods, and more sophisticated Bayesian-based phylogeny methods.

Finally, we introduce a novel probabilistic Bayesian method based on the DLTRS model, to sample reconciliations of a gene tree inside a species tree. The method uses MCMC framework to integrate LGTs, gene duplications, gene losses and sequence evolution under a relaxed molecular clock for substitution rates. The proposed sampling method estimates the posterior distribution of gene trees and provides the temporal information of LGT events over the lineages of a species tree. Analysis on simulated datasets reveal that our method performs well in identifying the true temporal estimates of LGT events. We applied our method to the genome-wide gene families for mollicutes and cyanobacteria, which gave an interesting insight into the potential LGTs highways. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. , 57 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2016:24
Keyword [en]
Evolution, Phylogenetics, Lateral Gene Transfer, Gene Families, Clustering
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-193637ISBN: 978-91-7729-131-2 (print)OAI: oai:DiVA.org:kth-193637DiVA: diva2:1033289
Public defence
2016-10-18, Air, SciLifeLab, Tomtebodavägen 23A, Solna, 14:00 (English)
Opponent
Supervisors
Note

QC 20161010

Available from: 2016-10-10 Created: 2016-10-06 Last updated: 2016-10-10Bibliographically approved
List of papers
1. fastphylo: Fast tools for phylogenetics
Open this publication in new window or tab >>fastphylo: Fast tools for phylogenetics
Show others...
2013 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, no 1, 334- p.Article in journal (Refereed) Published
Abstract [en]

Background: Distance methods are ubiquitous tools in phylogenetics. Their primary purpose may be to reconstruct evolutionary history, but they are also used as components in bioinformatic pipelines. However, poor computational efficiency has been a constraint on the applicability of distance methods on very large problem instances. Results: We present fastphylo, a software package containing implementations of efficient algorithms for two common problems in phylogenetics: estimating DNA/protein sequence distances and reconstructing a phylogeny from a distance matrix. We compare fastphylo with other neighbor joining based methods and report the results in terms of speed and memory efficiency. Conclusions: Fastphylo is a fast, memory efficient, and easy to use software suite. Due to its modular architecture, fastphylo is a flexible tool for many phylogenetic studies.

Place, publisher, year, edition, pages
BioMed Central, 2013
Keyword
Distance matrices, Distance method, Evolutionary history, Large problems, Memory efficient, Modular architectures, Neighbor joining, Phylogenetic studies
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-136421 (URN)10.1186/1471-2105-14-334 (DOI)000329901900001 ()24255987 (PubMedID)2-s2.0-84887664660 (Scopus ID)
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20140205

Available from: 2013-12-05 Created: 2013-12-05 Last updated: 2017-12-06Bibliographically approved
2. Quantitative synteny scoring improves homology inference and partitioning of gene families
Open this publication in new window or tab >>Quantitative synteny scoring improves homology inference and partitioning of gene families
2013 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, S12- p.Article in journal (Refereed) Published
Abstract [en]

Background: Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Results: Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. Conclusions: The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

Place, publisher, year, edition, pages
BioMed Central, 2013
Keyword
Efficient Algorithm, Eukaryotic Genomes, Protein Families, Orthologs, Identification, Clusters, Alignment, Blast, Link
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-136429 (URN)10.1186/1471-2105-14-S15-S12 (DOI)000328316700012 ()
Conference
11th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics, Lyon,France OCT 17-19, 2013
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20131219

Available from: 2013-12-05 Created: 2013-12-05 Last updated: 2017-12-06Bibliographically approved
3. Phylogenetic Partitioning of Gene Families
Open this publication in new window or tab >>Phylogenetic Partitioning of Gene Families
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Clustering and organizing molecular sequences is one of the central tasks in Bioinformatics. It is a common first step in, for example, phylogenomic analysis. For some tasks, a large gene family needs to be partitioned into more manageable subfamilies. In particular, Bayesian phylogenetic analysis can be very expensive. There is a need for easy and natural means of breaking up a gene family, with moderate computational requirements, to enable careful analysis of subfamilies with computationally expensive tools. We devised and implemented a method that infer and reconcile gene trees to species trees and identifies putative orthogroups as subfamilies. To achieve reasonable speed, approximate ML phylogenies are inferred using the FastTree method and combined with a subfamily-centered bootstrapping procedure to ensure robustness. Using the new method, very large clusters of sequences are now easier to manage in pipelines containing computationally expensive steps. The implementation of PhyloGenClust is available at a public repository, https://github.com/malagori/PhyloGenClust, under the GNU General Public License version 3. 

Keyword
Phylogenetic, Clustering, Gene Families
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-193634 (URN)
Note

QC 20161007

Available from: 2016-10-06 Created: 2016-10-06 Last updated: 2016-10-12Bibliographically approved
4. Probabilistic inference of lataral gene transfer events
Open this publication in new window or tab >>Probabilistic inference of lataral gene transfer events
Show others...
(English)Manuscript (preprint) (Other academic)
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-162935 (URN)
Funder
Swedish e‐Science Research Center
Note

QS 2015

Available from: 2015-03-26 Created: 2015-03-26 Last updated: 2016-10-12Bibliographically approved

Open Access in DiVA

fulltext(1265 kB)153 downloads
File information
File name FULLTEXT01.pdfFile size 1265 kBChecksum SHA-512
3419bcbe0d3c6517de4a2bb5f4675a5d381b7788f43177d2f578050615115fd099fabf09929121f418799aece7f0e3ad6113b7bdeff137d43c97ddb6443f8676
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Khan, Mehmood Alam
By organisation
Computational Science and Technology (CST)
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 153 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 769 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf