Change search
ReferencesLink to record
Permanent link

Direct link
Burnin estimation and convergence assessment in Bayesian phylogenetic inference
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).ORCID iD: 0000-0003-0539-3491
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).ORCID iD: 0000-0001-5341-1733
(English)Manuscript (preprint) (Other academic)
Abstract [en]

 Convergence assessment and burnin estimation are central concepts in Markov chain Monte Carlo algorithms. Studies on eects, statistical properties, and comparisons between dierent convergence assessment methods have been conducted during the past few decades. However, not much work has been done on the eect of convergence diagnostic on posterior distribution of tree parameters and which method should be used by researchers in Bayesian phylogenetics inference. In this study, we propose and evaluate two novel burnin estimation methods that estimate burnin using all parameters jointly. We also consider some other popular convergence diagnostics, evaluate them in light of parallel chains and quantify the eect of burnin estimates from various convergence diagnostics on the posterior distribution of trees. We motivate the use of convergence diagnostics to assess convergence and estimate burnin in Bayesian phylogenetics inference and found out that it is better to employ convergence diagnostics rather than remove a xed percentage as burnin. We concluded that the last burnin estimator using eective sample size appears to estimate burnin better than all other convergence diagnostics.

Keyword [en]
Convergence assessment
National Category
Bioinformatics and Systems Biology
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-180544OAI: oai:DiVA.org:kth-180544DiVA: diva2:895243
Note

QS 2016

Available from: 2016-01-18 Created: 2016-01-18 Last updated: 2016-02-01Bibliographically approved
In thesis
1. From genomes to post-processing of Bayesian inference of phylogeny
Open this publication in new window or tab >>From genomes to post-processing of Bayesian inference of phylogeny
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Life is extremely complex and amazingly diverse; it has taken billions of years of evolution to attain the level of complexity we observe in nature now and ranges from single-celled prokaryotes to multi-cellular human beings. With availability of molecular sequence data, algorithms inferring homology and gene families have emerged and similarity in gene content between two genes has been the major signal utilized for homology inference. Recently there has been a significant rise in number of species with fully sequenced genome, which provides an opportunity to investigate and infer homologs with greater accuracy and in a more informed way. Phylogeny analysis explains the relationship between member genes of a gene family in a simple, graphical and plausible way using a tree representation. Bayesian phylogenetic inference is a probabilistic method used to infer gene phylogenies and posteriors of other evolutionary parameters. Markov chain Monte Carlo (MCMC) algorithm, in particular using Metropolis-Hastings sampling scheme, is the most commonly employed algorithm to determine evolutionary history of genes. There are many softwares available that process results from each MCMC run, and explore the parameter posterior but there is a need for interactive software that can analyse both discrete and real-valued parameters, and which has convergence assessment and burnin estimation diagnostics specifically designed for Bayesian phylogenetic inference.

In this thesis, a synteny-aware approach for gene homology inference, called GenFamClust (GFC), is proposed that uses gene content and gene order conservation to infer homology. The feature which distinguishes GFC from earlier homology inference methods is that local synteny has been combined with gene similarity to infer homologs, without inferring homologous regions. GFC was validated for accuracy on a simulated dataset. Gene families were computed by applying clustering algorithms on homologs inferred from GFC, and compared for accuracy, dependence and similarity with gene families inferred from other popular gene family inference methods on a eukaryotic dataset. Gene families in fungi obtained from GFC were evaluated against pillars from Yeast Gene Order Browser. Genome-wide gene families for some eukaryotic species are computed using this approach.

Another topic focused in this thesis is the processing of MCMC traces for Bayesian phylogenetics inference. We introduce a new software VMCMC which simplifies post-processing of MCMC traces. VMCMC can be used both as a GUI-based application and as a convenient command-line tool. VMCMC supports interactive exploration, is suitable for automated pipelines and can handle both real-valued and discrete parameters observed in a MCMC trace. We propose and implement joint burnin estimators that are specifically applicable to Bayesian phylogenetics inference. These methods have been compared for similarity with some other popular convergence diagnostics. We show that Bayesian phylogenetic inference and VMCMC can be applied to infer valuable evolutionary information for a biological case – the evolutionary history of FERM domain.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. viii, 65 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2016:01
Keyword
Bayesian inference
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-181319 (URN)978-91-7595-849-1 (ISBN)
Public defence
2016-02-25, Fire, Tomtebodavägen 23, 171 65, Solna, 14:00 (English)
Opponent
Supervisors
Note

QC 20160201

Available from: 2016-02-01 Created: 2016-01-31 Last updated: 2016-02-01Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Ali, Raja HashimArvestad, Lars
By organisation
Computational Science and Technology (CST)
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 85 hits
ReferencesLink to record
Permanent link

Direct link