Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Phylogenetic Partitioning of Gene Families
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Clustering and organizing molecular sequences is one of the central tasks in Bioinformatics. It is a common first step in, for example, phylogenomic analysis. For some tasks, a large gene family needs to be partitioned into more manageable subfamilies. In particular, Bayesian phylogenetic analysis can be very expensive. There is a need for easy and natural means of breaking up a gene family, with moderate computational requirements, to enable careful analysis of subfamilies with computationally expensive tools. We devised and implemented a method that infer and reconcile gene trees to species trees and identifies putative orthogroups as subfamilies. To achieve reasonable speed, approximate ML phylogenies are inferred using the FastTree method and combined with a subfamily-centered bootstrapping procedure to ensure robustness. Using the new method, very large clusters of sequences are now easier to manage in pipelines containing computationally expensive steps. The implementation of PhyloGenClust is available at a public repository, https://github.com/malagori/PhyloGenClust, under the GNU General Public License version 3. 

Keyword [en]
Phylogenetic, Clustering, Gene Families
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-193634OAI: oai:DiVA.org:kth-193634DiVA: diva2:1033271
Note

QC 20161007

Available from: 2016-10-06 Created: 2016-10-06 Last updated: 2016-10-12Bibliographically approved
In thesis
1. Computational Problems in Modeling Evolution and Inferring Gene Families.
Open this publication in new window or tab >>Computational Problems in Modeling Evolution and Inferring Gene Families.
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Over the last few decades, phylogenetics has emerged as a very promising field, facilitating a comparative framework to explain the genetic relationships among all the living organisms on earth. These genetic relationships are typically represented by a bifurcating phylogenetic tree — the tree of life. Reconstructing a phylogenetic tree is one of the central tasks in evolutionary biology. The different evolutionary processes, such as gene duplications, gene losses, speciation, and lateral gene transfer events, make the phylogeny reconstruction task more difficult. However, with the rapid developments in sequencing technologies and availability of genome-scale sequencing data, give us the opportunity to understand these evolutionary processes in a more informed manner, and ultimately, enable us to reconstruct genes and species phylogenies more accurately. This thesis is an attempt to provide computational methods for phylogenetic inference and give tools to conduct genome-scale comparative evolutionary studies, such as detecting homologous sequences and inferring gene families.

In the first project, we present FastPhylo as a software package containing fast tools for reconstructing distance-based phylogenies. It implements the previously published efficient algorithms for estimating a distance matrix from the input sequences and reconstructing an un-rooted Neighbour Joining tree from a given distance matrix. Results on simulated datasets reveal that FastPhylo can handles hundred of thousands of sequences in a minimum time and memory efficient manner. The easy to use, well-defined interfaces, and the modular structure of FastPhylo allows it to be used in very large Bioinformatic pipelines.

In the second project, we present a synteny-aware gene homology method, called GenFamClust (GFC) that uses gene content and gene order conservation to detect homology. Results on simulated and biological datasets suggest that local synteny information combined with the sequence similarity improves the detection of homologs.

In the third project, we introduce a novel phylogeny-based clustering method, PhyloGenClust, which partitions a very large gene family into smaller subfamilies. ROC (receiver operating characteristics) analysis on synthetic datasets show that PhyloGenClust identify subfamilies more accurately. PhyloGenClust can be used as a middle tier clustering method between raw clustering methods, such as sequence similarity methods, and more sophisticated Bayesian-based phylogeny methods.

Finally, we introduce a novel probabilistic Bayesian method based on the DLTRS model, to sample reconciliations of a gene tree inside a species tree. The method uses MCMC framework to integrate LGTs, gene duplications, gene losses and sequence evolution under a relaxed molecular clock for substitution rates. The proposed sampling method estimates the posterior distribution of gene trees and provides the temporal information of LGT events over the lineages of a species tree. Analysis on simulated datasets reveal that our method performs well in identifying the true temporal estimates of LGT events. We applied our method to the genome-wide gene families for mollicutes and cyanobacteria, which gave an interesting insight into the potential LGTs highways. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. 57 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2016:24
Keyword
Evolution, Phylogenetics, Lateral Gene Transfer, Gene Families, Clustering
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-193637 (URN)978-91-7729-131-2 (ISBN)
Public defence
2016-10-18, Air, SciLifeLab, Tomtebodavägen 23A, Solna, 14:00 (English)
Opponent
Supervisors
Note

QC 20161010

Available from: 2016-10-10 Created: 2016-10-06 Last updated: 2016-10-10Bibliographically approved

Open Access in DiVA

fulltext(393 kB)56 downloads
File information
File name FULLTEXT01.pdfFile size 393 kBChecksum SHA-512
9d0af05eecf17b5b31b23ed20d19e1f2900064f06cc382aedb178353dd1e51b0d54fb5a16fc5474d45e3f31d297f1c75e8db20a7c9ec206e51d89f31b6b4c119
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Khan, Mehmood AlamArvestad, Lars
By organisation
Computational Science and Technology (CST)
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 56 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 3094 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf