Change search
ReferencesLink to record
Permanent link

Direct link
GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). (Lars Arvestad)ORCID iD: 0000-0003-0539-3491
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).ORCID iD: 0000-0002-6664-1607
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).ORCID iD: 0000-0001-5341-1733
2016 (English)In: BMC EVOLUTIONARY BIOLOGY, ISSN 1471-2148, Vol. 16Article in journal (Other academic) Published
Abstract [en]

Background: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. Results: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. Conclusions: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.

Place, publisher, year, edition, pages
BioMed Central, 2016. Vol. 16
Keyword [en]
Homology inference; Gene synteny; Gene similarity; Gene family; Clustering; Gene order conservation
National Category
Bioinformatics and Systems Biology
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-180542DOI: 10.1186/s12862-016-0684-2ISI: 000377161400002PubMedID: 27260514ScopusID: 2-s2.0-84973324604OAI: oai:DiVA.org:kth-180542DiVA: diva2:895223
Funder
Swedish e‐Science Research Center
Note

QC 20160628

Available from: 2016-01-18 Created: 2016-01-18 Last updated: 2016-08-31Bibliographically approved
In thesis
1. From genomes to post-processing of Bayesian inference of phylogeny
Open this publication in new window or tab >>From genomes to post-processing of Bayesian inference of phylogeny
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Life is extremely complex and amazingly diverse; it has taken billions of years of evolution to attain the level of complexity we observe in nature now and ranges from single-celled prokaryotes to multi-cellular human beings. With availability of molecular sequence data, algorithms inferring homology and gene families have emerged and similarity in gene content between two genes has been the major signal utilized for homology inference. Recently there has been a significant rise in number of species with fully sequenced genome, which provides an opportunity to investigate and infer homologs with greater accuracy and in a more informed way. Phylogeny analysis explains the relationship between member genes of a gene family in a simple, graphical and plausible way using a tree representation. Bayesian phylogenetic inference is a probabilistic method used to infer gene phylogenies and posteriors of other evolutionary parameters. Markov chain Monte Carlo (MCMC) algorithm, in particular using Metropolis-Hastings sampling scheme, is the most commonly employed algorithm to determine evolutionary history of genes. There are many softwares available that process results from each MCMC run, and explore the parameter posterior but there is a need for interactive software that can analyse both discrete and real-valued parameters, and which has convergence assessment and burnin estimation diagnostics specifically designed for Bayesian phylogenetic inference.

In this thesis, a synteny-aware approach for gene homology inference, called GenFamClust (GFC), is proposed that uses gene content and gene order conservation to infer homology. The feature which distinguishes GFC from earlier homology inference methods is that local synteny has been combined with gene similarity to infer homologs, without inferring homologous regions. GFC was validated for accuracy on a simulated dataset. Gene families were computed by applying clustering algorithms on homologs inferred from GFC, and compared for accuracy, dependence and similarity with gene families inferred from other popular gene family inference methods on a eukaryotic dataset. Gene families in fungi obtained from GFC were evaluated against pillars from Yeast Gene Order Browser. Genome-wide gene families for some eukaryotic species are computed using this approach.

Another topic focused in this thesis is the processing of MCMC traces for Bayesian phylogenetics inference. We introduce a new software VMCMC which simplifies post-processing of MCMC traces. VMCMC can be used both as a GUI-based application and as a convenient command-line tool. VMCMC supports interactive exploration, is suitable for automated pipelines and can handle both real-valued and discrete parameters observed in a MCMC trace. We propose and implement joint burnin estimators that are specifically applicable to Bayesian phylogenetics inference. These methods have been compared for similarity with some other popular convergence diagnostics. We show that Bayesian phylogenetic inference and VMCMC can be applied to infer valuable evolutionary information for a biological case – the evolutionary history of FERM domain.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. viii, 65 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2016:01
Keyword
Bayesian inference
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-181319 (URN)978-91-7595-849-1 (ISBN)
Public defence
2016-02-25, Fire, Tomtebodavägen 23, 171 65, Solna, 14:00 (English)
Opponent
Supervisors
Note

QC 20160201

Available from: 2016-02-01 Created: 2016-01-31 Last updated: 2016-02-01Bibliographically approved
2. Probabilistic Modelling of Domain and Gene Evolution
Open this publication in new window or tab >>Probabilistic Modelling of Domain and Gene Evolution
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution.

In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a domain sequence evolution model with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain and gene trees. By using this method, we performed analyses of Zinc-Finger and PRDM9 gene families, which provides an interesting insight of domain evolution.

Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. We evaluated the accuracy of our method on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. 69 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 19
Keyword
Phylogenetics, Phylogenomics, Evolution, Domain Evolution, Gene tree, Domain tree, Bayesian Inference, Markov Chain Monte Carlo, Homology Inference, Gene families, C2H2 Zinc-Finger, Reelin Protein
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-191352 (URN)978-91-7729-091-9 (ISBN)
External cooperation:
Public defence
2016-09-26, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, Stockholm, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160904

Available from: 2016-09-04 Created: 2016-08-29 Last updated: 2016-09-04Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMedScopus

Search in DiVA

By author/editor
Ali, Raja HashimMuhammad, Sayyed AuwnArvestad, Lars
By organisation
Computational Science and Technology (CST)
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 323 hits
ReferencesLink to record
Permanent link

Direct link