Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Probabilistic Modelling of Domain and Gene Evolution
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). (Jens Lagergren)ORCID-id: 0000-0002-6664-1607
2016 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution.

In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a domain sequence evolution model with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain and gene trees. By using this method, we performed analyses of Zinc-Finger and PRDM9 gene families, which provides an interesting insight of domain evolution.

Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. We evaluated the accuracy of our method on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference.

sted, utgiver, år, opplag, sider
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. , s. 69
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 19
Emneord [en]
Phylogenetics, Phylogenomics, Evolution, Domain Evolution, Gene tree, Domain tree, Bayesian Inference, Markov Chain Monte Carlo, Homology Inference, Gene families, C2H2 Zinc-Finger, Reelin Protein
HSV kategori
Forskningsprogram
Datalogi
Identifikatorer
URN: urn:nbn:se:kth:diva-191352ISBN: 978-91-7729-091-9 (tryckt)OAI: oai:DiVA.org:kth-191352DiVA, id: diva2:956729
Disputas
2016-09-26, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, Stockholm, Stockholm, 09:00 (engelsk)
Opponent
Veileder
Forskningsfinansiär
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Merknad

QC 20160904

Tilgjengelig fra: 2016-09-04 Laget: 2016-08-29 Sist oppdatert: 2018-01-10bibliografisk kontrollert
Delarbeid
1. Species tree aware simultaneous reconstruction of gene and domain evolution
Åpne denne publikasjonen i ny fane eller vindu >>Species tree aware simultaneous reconstruction of gene and domain evolution
(engelsk)Manuskript (preprint) (Annet vitenskapelig)
Abstract [en]

Most genes are composed of multiple domains with a common evolutionary history that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Similarly to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.

We introduce the Domain-DLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.

For this model, we present a MCMC based inference framework called Domain-DLRS that as input takes a dates species tree together with a multiple sequence alignment for each domain family, while it as output provids an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that Domain-DLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zinc-finger genes and show that most domain duplications have been tandem duplications, of which some have involved two or more domains, but non-tandem duplications have also been common, in particular in gene families of complex evolutionary history such as PRDM9.

Emneord
Probabilistic Modeling, Domain Evolution, Bayesian Inference, Domain Tree Reconstruction
HSV kategori
Forskningsprogram
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-191349 (URN)
Eksternt samarbeid:
Forskningsfinansiär
Swedish e‐Science Research Center
Merknad

QC 20160902

Tilgjengelig fra: 2016-08-29 Laget: 2016-08-29 Sist oppdatert: 2018-01-10bibliografisk kontrollert
2. Sequence Analysis and Evolutionary Studies of Reelin Proteins
Åpne denne publikasjonen i ny fane eller vindu >>Sequence Analysis and Evolutionary Studies of Reelin Proteins
2015 (engelsk)Inngår i: Bioinformatics and Biology Insights, ISSN 1177-9322, E-ISSN 1177-9322, Vol. 9, s. 187-193Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

The reelin gene is conserved across many vertebrate species, including humans. The protein product of this gene plays several important roles in early brain development and regulation of neural network plasticity of a matured brain structure. With an extended structure of 3461 amino acid sequences, consisting of eight reelin repeats, the human reelin sequence stands out as an exceptional model for evolutionary studies. In this study, sequence analysis of the human reelin and its homologues and reelin sequences from 104 other species is described in detail. Interesting sequence conservation patterns of individual repeats have been highlighted. Sequence phylogeny of the reelin sequences indicates a pattern similar to the evolution of the species, thereby serving as a highly conserved family for evolutionary purposes. Multiple sequence alignment of different reelin domain repeats, derived from homologues, suggests specific functions for individual repeats and high sequence conservation across reelin repeats from different organisms, albeit with few unusual domain architectures. A three-dimensional structural model of the full-length human reelin is now available that provides clues on residues at the dimer interface.

sted, utgiver, år, opplag, sider
Libertas Academica, 2015
Emneord
reelin protein, glycoprotein, domain repeats, phylogeny, domain architecture, neurogenesis, 3D modeling
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-181010 (URN)10.4137/BBI.S26530 (DOI)000367288300004 ()26715843 (PubMedID)2-s2.0-84961266629 (Scopus ID)
Eksternt samarbeid:
Merknad

QC 20160126

Tilgjengelig fra: 2016-01-26 Laget: 2016-01-26 Sist oppdatert: 2017-11-30bibliografisk kontrollert
3. Quantitative synteny scoring improves homology inference and partitioning of gene families
Åpne denne publikasjonen i ny fane eller vindu >>Quantitative synteny scoring improves homology inference and partitioning of gene families
2013 (engelsk)Inngår i: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, s. S12-Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Background: Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Results: Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. Conclusions: The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

sted, utgiver, år, opplag, sider
BioMed Central, 2013
Emneord
Efficient Algorithm, Eukaryotic Genomes, Protein Families, Orthologs, Identification, Clusters, Alignment, Blast, Link
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-136429 (URN)10.1186/1471-2105-14-S15-S12 (DOI)000328316700012 ()
Konferanse
11th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics, Lyon,France OCT 17-19, 2013
Forskningsfinansiär
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Merknad

QC 20131219

Tilgjengelig fra: 2013-12-05 Laget: 2013-12-05 Sist oppdatert: 2018-01-11bibliografisk kontrollert
4. GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm
Åpne denne publikasjonen i ny fane eller vindu >>GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm
2016 (engelsk)Inngår i: BMC EVOLUTIONARY BIOLOGY, ISSN 1471-2148, Vol. 16Artikkel i tidsskrift (Annet vitenskapelig) Published
Abstract [en]

Background: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. Results: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. Conclusions: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.

sted, utgiver, år, opplag, sider
BioMed Central, 2016
Emneord
Homology inference; Gene synteny; Gene similarity; Gene family; Clustering; Gene order conservation
HSV kategori
Forskningsprogram
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-180542 (URN)10.1186/s12862-016-0684-2 (DOI)000377161400002 ()27260514 (PubMedID)2-s2.0-84973324604 (Scopus ID)
Forskningsfinansiär
Swedish e‐Science Research Center
Merknad

QC 20160628

Tilgjengelig fra: 2016-01-18 Laget: 2016-01-18 Sist oppdatert: 2016-08-31bibliografisk kontrollert

Open Access i DiVA

thesis.pdf(1755 kB)229 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1755 kBChecksum SHA-512
d7621d06f729673637530b572eb6173e56d9bafd0ff7fb812c82f7b2195d1840e64afca76e86a7d3fa73f34697a60247ac18f4dd81187b887575026be2c2f95e
Type fulltextMimetype application/pdf

Personposter BETA

Muhammad, Sayyed Auwn

Søk i DiVA

Av forfatter/redaktør
Muhammad, Sayyed Auwn
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 229 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

isbn
urn-nbn

Altmetric

isbn
urn-nbn
Totalt: 1413 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf