Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Integrating Sequence Evolution into Probabilistic Orthology Analysis
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.ORCID iD: 0000-0002-2791-8773
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
Show others and affiliations
2015 (English)In: Systematic Biology, ISSN 1063-5157, E-ISSN 1076-836X, Vol. 64, no 6, 969-982 p.Article in journal (Refereed) Published
Abstract [en]

Orthology analysis, that is, finding out whether a pair of homologous genes are orthologs - stemming from a speciation - or paralogs - stemming from a gene duplication - is of central importance in computational biology, genome annotation, and phylogenetic inference. In particular, an orthologous relationship makes functional equivalence of the two genes highly likely. A major approach to orthology analysis is to reconcile a gene tree to the corresponding species tree, (most commonly performed using the most parsimonious reconciliation, MPR). However, most such phylogenetic orthology methods infer the gene tree without considering the constraints implied by the species tree and, perhaps even more importantly, only allow the gene sequences to influence the orthology analysis through the a priori reconstructed gene tree. We propose a sound, comprehensive Bayesian MCMC-based method, DLRSOrthology, to compute orthology probabilities. It efficiently sums over the possible gene trees and jointly takes into account the current gene tree, all possible reconciliations to the species tree, and the, typically strong, signal conveyed by the sequences. We compare our method with PrIME-GEM, a probabilistic orthology approach built on a probabilistic duplication-loss model, and MrBayesMPR, a probabilistic orthology approach that is based on conventional Bayesian inference coupled with MPR. We find that DLRSOrthology outperforms these competing approaches on synthetic data as well as on biological data sets and is robust to incomplete taxon sampling artifacts.

Place, publisher, year, edition, pages
Oxford University Press, 2015. Vol. 64, no 6, 969-982 p.
Keyword [en]
Comparative genomics, orthology, paralogy, gene duplication, gene loss, sequence evolution, relaxed molecular clock, probabilistic modeling, phylogenetics, tree reconciliation, tree realization
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-168167DOI: 10.1093/sysbio/syv044ISI: 000363168100007Scopus ID: 2-s2.0-84946111865OAI: oai:DiVA.org:kth-168167DiVA: diva2:814543
Funder
Swedish e‐Science Research CenterSwedish Research Council, 2010-4757Magnus Bergvall Foundation
Note

QC 20151216. Updated from manuscript to article in journal.

Available from: 2015-05-27 Created: 2015-05-27 Last updated: 2017-12-04Bibliographically approved
In thesis
1. Probabilistic Models for Species Tree Inference and Orthology Analysis
Open this publication in new window or tab >>Probabilistic Models for Species Tree Inference and Orthology Analysis
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

A phylogenetic tree is used to model gene evolution and species evolution using molecular sequence data. For artifactual and biological reasons, a gene tree may differ from a species tree, a phenomenon known as gene tree-species tree incongruence. Assuming the presence of one or more evolutionary events, e.g., gene duplication, gene loss, and lateral gene transfer (LGT), the incongruence may be explained using a reconciliation of a gene tree inside a species tree. Such information has biological utilities, e.g., inference of orthologous relationship between genes.

In this thesis, we present probabilistic models and methods for orthology analysis and species tree inference, while accounting for evolutionary factors such as gene duplication, gene loss, and sequence evolution. Furthermore, we use a probabilistic LGT-aware model for inferring gene trees having temporal information for duplication and LGT events.

In the first project, we present a Bayesian method, called DLRSOrthology, for estimating orthology probabilities using the DLRS model: a probabilistic model integrating gene evolution, a relaxed molecular clock for substitution rates, and sequence evolution. We devise a dynamic programming algorithm for efficiently summing orthology probabilities over all reconciliations of a gene tree inside a species tree. Furthermore, we present heuristics based on receiver operating characteristics (ROC) curve to estimate suitable thresholds for deciding orthology events. Our method, as demonstrated by synthetic and biological results, outperforms existing probabilistic approaches in accuracy and is robust to incomplete taxon sampling artifacts.

In the second project, we present a probabilistic method, based on a mixture model, for species tree inference. The method employs a two-phase approach, where in the first phase, a structural expectation maximization algorithm, based on a mixture model, is used to reconstruct a maximum likelihood set of candidate species trees. In the second phase, in order to select the best species tree, each of the candidate species tree is evaluated using PrIME-DLRS: a method based on the DLRS model. The method is accurate, efficient, and scalable when compared to a recent probabilistic species tree inference method called PHYLDOG. We observe that, in most cases, the analysis constituted only by the first phase may also be used for selecting the target species tree, yielding a fast and accurate method for larger datasets.

Finally, we devise a probabilistic method based on the DLTRS model: an extension of the DLRS model to include LGT events, for sampling reconciliations of a gene tree inside a species tree. The method enables us to estimate gene trees having temporal information for duplication and LGT events. To the best of our knowledge, this is the first probabilistic method that takes gene sequence data directly into account for sampling reconciliations that contains information about LGT events. Based on the synthetic data analysis, we believe that the method has the potential to identify LGT highways.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. vi, 65 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 12
Keyword
phylogenetics, phylogenomics, gene tree, species tree, expectation maximization, mixture model, dynamic programming, markov chain monte carlo, PrIME, JPrIME
National Category
Bioinformatics (Computational Biology) Computer Science
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-168146 (URN)978-91-7595-619-0 (ISBN)
Public defence
2015-06-12, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, 13:00 (English)
Opponent
Supervisors
Funder
Science for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20150529

Available from: 2015-05-29 Created: 2015-05-27 Last updated: 2015-05-29Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopus

Authority records BETA

Ullah, IkramLagergren, Jens

Search in DiVA

By author/editor
Ullah, IkramSjöstrand, JoelAndersson, PeterSennblad, BengtLagergren, Jens
By organisation
Computational Biology, CBScience for Life Laboratory, SciLifeLabSeRC - Swedish e-Science Research Centre
In the same journal
Systematic Biology
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 256 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf