Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). (Lars Arvestad)ORCID-id: 0000-0003-0539-3491
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST).ORCID-id: 0000-0002-6664-1607
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST).ORCID-id: 0000-0001-5341-1733
2016 (Engelska)Ingår i: BMC EVOLUTIONARY BIOLOGY, ISSN 1471-2148, Vol. 16Artikel i tidskrift (Övrigt vetenskapligt) Published
Abstract [en]

Background: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. Results: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. Conclusions: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.

Ort, förlag, år, upplaga, sidor
BioMed Central, 2016. Vol. 16
Nyckelord [en]
Homology inference; Gene synteny; Gene similarity; Gene family; Clustering; Gene order conservation
Nationell ämneskategori
Bioinformatik och systembiologi
Forskningsämne
Datalogi
Identifikatorer
URN: urn:nbn:se:kth:diva-180542DOI: 10.1186/s12862-016-0684-2ISI: 000377161400002PubMedID: 27260514Scopus ID: 2-s2.0-84973324604OAI: oai:DiVA.org:kth-180542DiVA, id: diva2:895223
Forskningsfinansiär
Swedish e‐Science Research Center
Anmärkning

QC 20160628

Tillgänglig från: 2016-01-18 Skapad: 2016-01-18 Senast uppdaterad: 2016-08-31Bibliografiskt granskad
Ingår i avhandling
1. From genomes to post-processing of Bayesian inference of phylogeny
Öppna denna publikation i ny flik eller fönster >>From genomes to post-processing of Bayesian inference of phylogeny
2016 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Life is extremely complex and amazingly diverse; it has taken billions of years of evolution to attain the level of complexity we observe in nature now and ranges from single-celled prokaryotes to multi-cellular human beings. With availability of molecular sequence data, algorithms inferring homology and gene families have emerged and similarity in gene content between two genes has been the major signal utilized for homology inference. Recently there has been a significant rise in number of species with fully sequenced genome, which provides an opportunity to investigate and infer homologs with greater accuracy and in a more informed way. Phylogeny analysis explains the relationship between member genes of a gene family in a simple, graphical and plausible way using a tree representation. Bayesian phylogenetic inference is a probabilistic method used to infer gene phylogenies and posteriors of other evolutionary parameters. Markov chain Monte Carlo (MCMC) algorithm, in particular using Metropolis-Hastings sampling scheme, is the most commonly employed algorithm to determine evolutionary history of genes. There are many softwares available that process results from each MCMC run, and explore the parameter posterior but there is a need for interactive software that can analyse both discrete and real-valued parameters, and which has convergence assessment and burnin estimation diagnostics specifically designed for Bayesian phylogenetic inference.

In this thesis, a synteny-aware approach for gene homology inference, called GenFamClust (GFC), is proposed that uses gene content and gene order conservation to infer homology. The feature which distinguishes GFC from earlier homology inference methods is that local synteny has been combined with gene similarity to infer homologs, without inferring homologous regions. GFC was validated for accuracy on a simulated dataset. Gene families were computed by applying clustering algorithms on homologs inferred from GFC, and compared for accuracy, dependence and similarity with gene families inferred from other popular gene family inference methods on a eukaryotic dataset. Gene families in fungi obtained from GFC were evaluated against pillars from Yeast Gene Order Browser. Genome-wide gene families for some eukaryotic species are computed using this approach.

Another topic focused in this thesis is the processing of MCMC traces for Bayesian phylogenetics inference. We introduce a new software VMCMC which simplifies post-processing of MCMC traces. VMCMC can be used both as a GUI-based application and as a convenient command-line tool. VMCMC supports interactive exploration, is suitable for automated pipelines and can handle both real-valued and discrete parameters observed in a MCMC trace. We propose and implement joint burnin estimators that are specifically applicable to Bayesian phylogenetics inference. These methods have been compared for similarity with some other popular convergence diagnostics. We show that Bayesian phylogenetic inference and VMCMC can be applied to infer valuable evolutionary information for a biological case – the evolutionary history of FERM domain.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2016. s. viii, 65
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 2016:01
Nyckelord
Bayesian inference
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-181319 (URN)978-91-7595-849-1 (ISBN)
Disputation
2016-02-25, Fire, Tomtebodavägen 23, 171 65, Solna, 14:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20160201

Tillgänglig från: 2016-02-01 Skapad: 2016-01-31 Senast uppdaterad: 2018-01-10Bibliografiskt granskad
2. Probabilistic Modelling of Domain and Gene Evolution
Öppna denna publikation i ny flik eller fönster >>Probabilistic Modelling of Domain and Gene Evolution
2016 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution.

In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a domain sequence evolution model with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain and gene trees. By using this method, we performed analyses of Zinc-Finger and PRDM9 gene families, which provides an interesting insight of domain evolution.

Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. We evaluated the accuracy of our method on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference.

Ort, förlag, år, upplaga, sidor
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. s. 69
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 19
Nyckelord
Phylogenetics, Phylogenomics, Evolution, Domain Evolution, Gene tree, Domain tree, Bayesian Inference, Markov Chain Monte Carlo, Homology Inference, Gene families, C2H2 Zinc-Finger, Reelin Protein
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-191352 (URN)978-91-7729-091-9 (ISBN)
Externt samarbete:
Disputation
2016-09-26, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, Stockholm, Stockholm, 09:00 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Anmärkning

QC 20160904

Tillgänglig från: 2016-09-04 Skapad: 2016-08-29 Senast uppdaterad: 2018-01-10Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextPubMedScopus

Personposter BETA

Ali, Raja HashimMuhammad, Sayyed AuwnArvestad, Lars

Sök vidare i DiVA

Av författaren/redaktören
Ali, Raja HashimMuhammad, Sayyed AuwnArvestad, Lars
Av organisationen
Beräkningsvetenskap och beräkningsteknik (CST)
Bioinformatik och systembiologi

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetricpoäng

doi
pubmed
urn-nbn
Totalt: 811 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf