Change search
ReferencesLink to record
Permanent link

Direct link
Species tree aware simultaneous reconstruction of gene and domain evolution
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). (Jens Lagergren)ORCID iD: 0000-0002-6664-1607
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Most genes are composed of multiple domains with a common evolutionary history that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Similarly to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.

We introduce the Domain-DLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.

For this model, we present a MCMC based inference framework called Domain-DLRS that as input takes a dates species tree together with a multiple sequence alignment for each domain family, while it as output provids an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that Domain-DLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zinc-finger genes and show that most domain duplications have been tandem duplications, of which some have involved two or more domains, but non-tandem duplications have also been common, in particular in gene families of complex evolutionary history such as PRDM9.

Keyword [en]
Probabilistic Modeling, Domain Evolution, Bayesian Inference, Domain Tree Reconstruction
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-191349OAI: oai:DiVA.org:kth-191349DiVA: diva2:956242
Funder
Swedish e‐Science Research Center
Note

QC 20160902

Available from: 2016-08-29 Created: 2016-08-29 Last updated: 2016-09-02Bibliographically approved
In thesis
1. Probabilistic Modelling of Domain and Gene Evolution
Open this publication in new window or tab >>Probabilistic Modelling of Domain and Gene Evolution
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution.

In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a domain sequence evolution model with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain and gene trees. By using this method, we performed analyses of Zinc-Finger and PRDM9 gene families, which provides an interesting insight of domain evolution.

Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. We evaluated the accuracy of our method on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. 69 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 19
Keyword
Phylogenetics, Phylogenomics, Evolution, Domain Evolution, Gene tree, Domain tree, Bayesian Inference, Markov Chain Monte Carlo, Homology Inference, Gene families, C2H2 Zinc-Finger, Reelin Protein
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-191352 (URN)978-91-7729-091-9 (ISBN)
External cooperation:
Public defence
2016-09-26, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, Stockholm, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160904

Available from: 2016-09-04 Created: 2016-08-29 Last updated: 2016-09-04Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Muhammad, Sayyed AuwnLagergren, Jens
By organisation
Computational Science and Technology (CST)
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 107 hits
ReferencesLink to record
Permanent link

Direct link