Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Taking advantage of phylogenetic trees in comparative genomics
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.ORCID iD: 0000-0002-5896-473X
2008 (English)Doctoral thesis, comprehensive summary (Other scientific)
Abstract [en]

Phylogenomics can be regarded as evolution and genomics in co-operation. Various kinds of evolutionary studies, gene family analysis among them, demand access to genome-scale datasets. But it is also clear that many genomics studies, such as assignment of gene function, are much improved by evolutionary analysis. The work leading to this thesis is a contribution to the phylogenomics field. We have used phylogenetic relationships between species in genome-scale searches for two intriguing genomic features, namely and A-to-I RNA editing. In the first case we used pairwise species comparisons, specifically human-mouse and human-chimpanzee, to infer existence of functional mammalian pseudogenes. In the second case we profited upon later years' rapid growth of the number of sequenced genomes, and used 17-species multiple sequence alignments. In both these studies we have used non-genomic data, gene expression data and synteny relations among these, to verify predictions. In the A-to-I editing project we used 454 sequencing for experimental verification.

We have further contributed a maximum a posteriori (MAP) method for fast and accurate dating analysis of speciations and other evolutionary events. This work follows recent years' trend of leaving the strict molecular clock when performing phylogenetic inference. We discretised the time interval from the leaves to the root in the tree, and used a dynamic programming (DP) algorithm to optimally factorise branch lengths into substitution rates and divergence times. We analysed two biological datasets and compared our results with recent MCMC-based methodologies. The dating point estimates that our method delivers were found to be of high quality while the gain in speed was dramatic.

Finally we applied the DP strategy in a new setting. This time we used a grid laid out on a species tree instead of on an interval. The discretisation gives together with speciation times a common timeframe for a gene tree and the corresponding species tree. This is the key to integration of the sequence evolution process and the gene evolution process. Out of several potential application areas we chose gene tree reconstruction. We performed genome-wide analysis of yeast gene families and found that our methodology performs very well.

Place, publisher, year, edition, pages
Stockholm: KTH , 2008. , 53 p.
Series
Trita-CSC-A, ISSN 1653-5723 ; 2008:09
Keyword [en]
Computer Science
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:kth:diva-4757ISBN: 978-91-7178-987-7 (print)OAI: oai:DiVA.org:kth-4757DiVA: diva2:13796
Public defence
2008-06-04, FD05, Albanova, Roslagstullsbacken 21, Stockholm, 09:30
Opponent
Supervisors
Note
QC 20100923Available from: 2008-05-16 Created: 2008-05-16 Last updated: 2010-09-23Bibliographically approved
List of papers
1. Genome-wide survey for biologically functional pseudogenes
Open this publication in new window or tab >>Genome-wide survey for biologically functional pseudogenes
2006 (English)In: PloS Computational Biology, ISSN 1553-734X, E-ISSN 1553-7358, Vol. 2, no 5, 358-369 p.Article in journal (Refereed) Published
Abstract [en]

According to current estimates there exist about 20,000 pseudogenes in a mammalian genome. The vast majority of these are disabled and nonfunctional copies of protein-coding genes which, therefore, evolve neutrally. Recent findings that a Makorin1 pseudogene, residing on mouse Chromosome 5, is, indeed, in vivo vital and also evolutionarily preserved, encouraged us to conduct a genome-wide survey for other functional pseudogenes in human, mouse, and chimpanzee. We identify to our knowledge the first examples of conserved pseudogenes common to human and mouse, originating from one duplication predating the human-mouse species split and having evolved as pseudogenes since the species split. Functionality is one possible way to explain the apparently contradictory properties of such pseudogene pairs, i. e., high conservation and ancient origin. The hypothesis of functionality is tested by comparing expression evidence and synteny of the candidates with proper test sets. The tests suggest potential biological function. Our candidate set includes a small set of long-lived pseudogenes whose unknown potential function is retained since before the human - mouse species split, and also a larger group of primate-specific ones found from human - chimpanzee searches. Two processed sequences are notable, their conservation since the human - mouse split being as high as most protein-coding genes; one is derived from the protein Ataxin 7- like 3 ( ATX7NL3), and one from the Spinocerebellar ataxia type 1 protein (ATX1). Our approach is comparative and can be applied to any pair of species. It is implemented by a semi-automated pipeline based on cross- species BLAST comparisons and maximum-likelihood phylogeny estimations. To separate pseudogenes from protein- coding genes, we use standard methods, utilizing in- frame disablements, as well as a probabilistic filter based on Ka/ Ks ratios.

Keyword
ataxin, ataxin 7, chimpanzee, chromosome 5, gene expression, genome, human, medical research, mouse, nonhuman, phylogeny, pseudogene, review, synteny
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-8462 (URN)10.1371/journal.pcbi.0020046 (DOI)000239493900005 ()2-s2.0-33646941894 (Scopus ID)
Note
QC 20100916Available from: 2008-05-16 Created: 2008-05-16 Last updated: 2017-12-14Bibliographically approved
2. Birth-death prior on phylogeny and speed dating
Open this publication in new window or tab >>Birth-death prior on phylogeny and speed dating
2008 (English)In: BMC Evolutionary Biology, ISSN 1471-2148, E-ISSN 1471-2148, Vol. 8, no 1, 77- p.Article in journal (Refereed) Published
Abstract [en]

Background: In recent years there has been a trend of leaving the strict molecular clock in order to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates and divergence times makes formulation of informative prior distributions for branch lengths possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution rates among lineages have been proposed, enabling simultaneous inference of substitution rates and divergence times. This problem has, however, mainly been analysed in the Markov chain Monte Carlo (MCMC) framework, an approach requiring computation times of hours or days when applied to large phylogenies.

Results: We demonstrate that a hill-climbing maximum a posteriori (MAP) adaptation of the MCMC scheme results in considerable gain in computational efficiency. We demonstrate also that a novel dynamic programming (DP) algorithm for branch length factorization, useful both in the hill-climbing and in the MCMC setting, further reduces computation time. For the problem of inferring rates and times parameters on a fixed tree, we perform simulations, comparisons between hill-climbing and MCMC on a plant rbcL gene dataset, and dating analysis on an animal mtDNA dataset, showing that our methodology enables efficient, highly accurate analysis of very large trees. Datasets requiring a computation time of several days with MCMC can with our MAP algorithm be accurately analysed in less than a minute. From the results of our example analyses, we conclude that our methodology generally avoids getting trapped early in local optima. For the cases where this nevertheless can be a problem, for instance when we in addition to the parameters also infer the tree topology, we show that the problem can be evaded by using a simulated-annealing like (SAL) method in which we favour tree swaps early in the inference while biasing our focus towards rate and time parameter changes later on.

Conclusion: Our contribution leaves the field open for fast and accurate dating analysis of nucleotide sequence data. Modeling branch substitutions rates and divergence times separately allows us to include birth-death priors on the times without the assumption of a molecular clock. The methodology is easily adapted to take data from fossil records into account and it can be used together with a broad range of rate and substitution models.

Keyword
CHAIN MONTE-CARLO, ESTIMATING DIVERGENCE TIMES, MOLECULAR CLOCK, LIKELIHOOD APPROACH, EVOLUTIONARY TREES, MAXIMUM-LIKELIHOOD, DNA-SEQUENCES, DATES, INFERENCE, PROBABILITY
National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-8463 (URN)10.1186/1471-2148-8-77 (DOI)000254282900001 ()2-s2.0-41149156444 (Scopus ID)
Note
QC 20100901Available from: 2008-05-16 Created: 2008-05-16 Last updated: 2017-12-14Bibliographically approved
3. A computational screen for site selective A-to-I editing detects novel sites in neuron specific Hu proteins
Open this publication in new window or tab >>A computational screen for site selective A-to-I editing detects novel sites in neuron specific Hu proteins
Show others...
2010 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 11Article in journal (Refereed) Published
Abstract [en]

Background: Several bioinformatic approaches have previously been used to find novel sites of ADAR mediated A-to-I RNA editing in human. These studies have discovered thousands of genes that are hyper-edited in their non-coding intronic regions, especially in alu retrotransposable elements, but very few substrates that are site-selectively edited in coding regions. Known RNA edited substrates suggest, however, that site selective A-to-I editing is particularly important for normal brain development in mammals. Results: We have compiled a screen that enables the identification of new sites of site-selective editing, primarily in coding sequences. To avoid hyper-edited repeat regions, we applied our screen to the alu-free mouse genome. Focusing on the mouse also facilitated better experimental verification. To identify candidate sites of RNA editing, we first performed an explorative screen based on RNA structure and genomic sequence conservation. We further evaluated the results of the explorative screen by determining which transcripts were enriched for A-G mismatches between the genomic template and the expressed sequence since the editing product, inosine (I), is read as guanosine (G) by the translational machinery. For expressed sequences, we only considered coding regions to focus entirely on re-coding events. Lastly, we refined the results from the explorative screen using a novel scoring scheme based on characteristics for known A-to-I edited sites. The extent of editing in the final candidate genes was verified using total RNA from mouse brain and 454 sequencing. Conclusions: Using this method, we identified and confirmed efficient editing at one site in the Gabra3 gene. Editing was also verified at several other novel sites within candidates predicted to be edited. Five of these sites are situated in genes coding for the neuron-specific RNA binding proteins HuB and HuD.

Keyword
double-stranded-rna, pre-messenger-rna, adenosine deamination, snp, database, identification, adar1, gene, sequences, mouse, information
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-19281 (URN)10.1186/1471-2105-11-6 (DOI)000275198500001 ()2-s2.0-77649109065 (Scopus ID)
Funder
Swedish Research Council
Note
QC 20100525Available from: 2010-08-05 Created: 2010-08-05 Last updated: 2017-12-12Bibliographically approved
4. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis
Open this publication in new window or tab >>Simultaneous Bayesian gene tree reconstruction and reconciliation analysis
2009 (English)In: Proceedings of the National Academy of Sciences of the United States of America, ISSN 0027-8424, E-ISSN 1091-6490, Vol. 106, 5714-5719 p.Article in journal (Refereed) Published
Abstract [en]

We present GSR, a probabilistic model integrating gene duplication, sequence evolution, and a relaxed molecular clock for substitution rates, that enables genomewide analysis of gene families. The gene duplication and loss process is a major cause for incongruence between gene and species tree, and deterministic methods have been developed to explain such differences through tree reconciliations. Although probabilistic methods for phylogenetic inference have been around for decades, probabilistic reconciliation methods are far less established. Based on our model, we have implemented a Bayesian analysis tool, PrIME-GSR, for gene tree inference that takes a known species tree into account. Our implementation is sound and we demonstrate its utility for genomewide gene-family analysis by applying it to recently presented yeast data. We validate PrIME-GSR by comparing with previous analyses of these data that take advantage of gene order information. In a case study we apply our method to the ADH gene family and are able to draw biologically relevant conclusions concerning gene duplications creating key yeast phenotypes. On a higher level this shows the biological relevance of our method. The obtained results demonstrate the value of a relaxed molecular clock. Our good performance will extend to species where gene order conservation is insufficient.

National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-8465 (URN)10.1073/pnas.0806251106 (DOI)000264967500048 ()2-s2.0-65249107239 (Scopus ID)
Note
Original title: Gene tree analysis reaching maturity QC 20100923Available from: 2008-05-16 Created: 2008-05-16 Last updated: 2017-12-14Bibliographically approved

Open Access in DiVA

fulltext(728 kB)738 downloads
File information
File name FULLTEXT01.pdfFile size 728 kBChecksum MD5
fae9d13e63cc9c6961d678a0fd9ebe890809be6a8733b7a47da8af91db06bff846ffbbb1
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Åkerborg, Örjan
By organisation
Computational Biology, CB
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 738 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 440 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf