Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Computational problems in evolution: Multiple alignment, genome rearrangements, and tree reconstruction
KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
2006 (English)Doctoral thesis, comprehensive summary (Other scientific)
Abstract [en]

Reconstructing the evolutionary history of a set of species is a fundamental problem in biology. This thesis concerns computational problems that arise in different settings and stages of phylogenetic tree reconstruction, but also in other contexts. The contributions include:

• A new distance-based tree reconstruction method with optimal reconstruction radius and optimal runtime complexity. Included in the result is a greatly simplified proof that the NJ algorithm also has optimal reconstruction radius. (co-author Jens Lagergren)

• NP-hardness results for the most common variations of Multiple Alignment. In particular, it is shown that SP-score, Star Alignment, and Tree Alignment, are NP hard for all metric symbol distances over all binary or larger alphabets.

• A 1.375-approximation algorithm for Sorting By Transpositions (SBT). SBT is the problem of sorting a permutation using as few block-transpositions as possible. The complexity of this problem is still open and it was a ten-year-old open problem to improve the best known 1.5-approximation ratio. The 1.375-approximation algorithm is based on a new upper bound on the diameter of 3-permutations. Moreover, a new lower bound on the transposition diameter of the symmetric group is presented and the exact transposition diameter of simple permutations is determined. (co-author Tzvika Hartman)

• Approximation, fixed-parameter tractable, and fast heuristic algorithms for two variants of the Ancestral Maximum Likelihood (AML) problem: when the phylogenetic tree is known and when it is unknown. AML is the problem of reconstructing the most likely genetic sequences of extinct ancestors along with the most likely mutation probabilities on the edges, given the phylogenetic tree and sequences at the leafs. (co-author Tamir Tuller)

• An algorithm for computing the number of mutational events between aligned DNA sequences which is several hundred times faster than the famous Phylip packages. Since pairwise distance estimation is a bottleneck in distance-based phylogeny reconstruction, the new algorithm improves the overall running time of many distancebased methods by a factor of several hundred. (co-author Jens Lagergren)

Place, publisher, year, edition, pages
Stockholm: KTH , 2006. , iv, 54 p.
Series
Trita-CSC-A, ISSN 1653-5723 ; 06/22--SE
Keyword [en]
Distance Methods, Tree reconstruction, Sorting by Transpositions, multiple alignment, ancestral sequences
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-4170ISBN: 978-91-7178-511-4 (print)OAI: oai:DiVA.org:kth-4170DiVA: diva2:11046
Public defence
2007-01-15, FA32, Albanova, Roslagstullsbacken 21, Stockholm, 10:00
Opponent
Supervisors
Note
QC 20110121Available from: 2006-11-15 Created: 2006-11-15 Last updated: 2011-01-21Bibliographically approved
List of papers
1. Settling the Intractability of Multiple Alignment
Open this publication in new window or tab >>Settling the Intractability of Multiple Alignment
2006 (English)In: Journal of Computational Biology, ISSN 1066-5277, E-ISSN 1557-8666, Vol. 13, no 7, 1323-1339 p.Article in journal (Refereed) Published
Abstract [en]

Multiple alignment is a core problem in computational biology that has received much attention over the years, both in the line of heuristics and hardness results. In most expositions of the problem it is referred to as NP-hard and references are given to one of the available hardness results. However, previous to this paper not even the most elementary variation of the problem, multiple alignment under the unit metric, had been proved hard. The aim of this paper is to settle the NP-hardness of the most common variations of multiple alignment. The following variations are shown NP-hard for all metrics over binary or larger alphabets: Multiple Alignment with SP-score, Star Alignment, and Tree Alignment ( for a given phylogeny). In addition, NP-hardness results are provided for Consensus Patterns and Substring Parsimony.

Keyword
multiple alignment, NP-hard, SP-score, star alignment, tree alignment, consensus patterns, substring parsimony
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6352 (URN)000241414800004 ()2-s2.0-33846197539 (Scopus ID)
Note
QC 20110121Available from: 2006-11-15 Created: 2006-11-15 Last updated: 2017-12-14Bibliographically approved
2. Fast Neighbor Joining
Open this publication in new window or tab >>Fast Neighbor Joining
2009 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 410, no 21-23, 1993-2000 p.Article in journal (Refereed) Published
Abstract [en]

 Reconstructing the evolutionary history of a set of species is a fundamental problem in biology and methods for solving this problem are gaged based on two characteristics: accuracy and efficiency. Neighbor joining (NJ) is a so-called distance-based method that, thanks to its good accuracy and speed, has been embraced by the phylogeny community. it takes the distances between n taxa and produces in Theta(n(3)) time a phylogenetic tree, i.e., a tree which aims to describe the evolutionary history of the taxa. In addition to performing well in practice, the NJ algorithm has optimal reconstruction radius.

The contribution of this paper is twofold: (1) we present an algorithm called Fast Neighbor Joining(FNJ) with optimal reconstruction radius and optimal run time complexity O(n(2)) and (2) we present a greatly simplified proof for the correctness of NJ. initial experiments show that FNJ in practice has almost the same accuracy as NJ, indicating that the property of optimal reconstruction radius has great importance to their good performance. Moreover, we show how improved running time can be achieved for Computing the so-called correction formulas.

Keyword
Neighbor joining, Phylogeny, Nearly additive, Fast neighbor joining, Pairwise string distance
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6353 (URN)10.1016/j.tcs.2008.12.040 (DOI)000266178200005 ()2-s2.0-64249108632 (Scopus ID)
Note
QC 20100914Available from: 2006-11-15 Created: 2006-11-15 Last updated: 2017-12-14Bibliographically approved
3. A 1.375-Approximation Algorithm for Sorting by Transpositions
Open this publication in new window or tab >>A 1.375-Approximation Algorithm for Sorting by Transpositions
2006 (English)In: IEEE/ACM Transactions on Computational Biology & Bioinformatics, ISSN 1545-5963, E-ISSN 1557-9964, Vol. 3, no 4, 369-379 p.Article in journal (Refereed) Published
Abstract [en]

Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a different location. The complexity of this problem is still open and it has been a 10-year-old open problem to improve the best known 1.5-approximation algorithm. In this paper, we provide a 1.375-approximation algorithm for sorting by transpositions. The algorithm is based on a new upper bound on the diameter of 3-permutations. In addition, we present some new results regarding the transposition diameter: We improve the lower bound for the transposition diameter of the symmetric group and determine the exact transposition diameter of simple permutations.

Keyword
computational biology, genome rearrangements, sorting permutations by transpositions
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6354 (URN)10.1109/TCBB.2006.44 (DOI)000241720700006 ()2-s2.0-33845654494 (Scopus ID)
Conference
5th International Workshop on Algorithms in Bioinformatics (WABI 2005) Mallorca, Spain, OCT 03-06, 2005
Note

QC 20110121 QC 20110927.

Available from: 2006-11-15 Created: 2006-11-15 Last updated: 2017-12-14Bibliographically approved
4. Fast Computation of Distance Estimators
Open this publication in new window or tab >>Fast Computation of Distance Estimators
2007 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 8, 89- p.Article in journal (Refereed) Published
Abstract [en]

Background: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n3). Unfortunately, the fastest practical algorithms known for Computing the distance matrix, from n sequences of length l, takes time proportional to l·n2. Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications. Results: We give an advanced algorithm for Computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity Symbols. This new method is shown to be more accurate as well as faster than earlier methods. Conclusion: Our novel algorithm for Computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds.

Keyword
DNA, SUBSTITUTIONS, REGION
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6355 (URN)10.1186//1471-2105-8-89 (DOI)000245188700001 ()2-s2.0-33947657892 (Scopus ID)
Note
QC 20100914. Uppdaterad från Manuskript till Artikel (20100914)Available from: 2006-11-15 Created: 2006-11-15 Last updated: 2017-12-14Bibliographically approved
5. Reconstruction of Ancestral Genomic Sequences Using Likelihood
Open this publication in new window or tab >>Reconstruction of Ancestral Genomic Sequences Using Likelihood
2007 (English)In: Journal of Computational Biology, ISSN 1066-5277, E-ISSN 1557-8666, Vol. 14, no 2, 216-237 p.Article in journal (Refereed) Published
Abstract [en]

A challenging task in computational biology is the reconstruction of genomic sequences of extinct ancestors, given the phylogenetic tree and the sequences at the leafs. This task is best solved by calculating the most likely estimate of the ancestral sequences, along with the most likely edge lengths. We deal with this problem and also the variant in which the phylogenetic tree in addition to the ancestral sequences need to be estimated. The latter problem is known to be NP-hard, while the computational complexity of the former is unknown. Currently, all algorithms for solving these problems are heuristics without performance guarantees. The biological importance of these problems calls for developing better algorithms with guarantees of finding either optimal or approximate solutions. We develop approximation, fix parameter tractable ( FPT), and fast heuristic algorithms for two variants of the problem; when the phylogenetic tree is known and when it is unknown. The approximation algorithm guarantees a solution with a log- likelihood ratio of 2 relative to the optimal solution. The FPT has a running time which is polynomial in the length of the sequences and exponential in the number of taxa. This makes it useful for calculating the optimal solution for small trees. Moreover, we combine the approximation algorithm and the FPT into an algorithm with arbitrary good approximation guarantee ( PTAS). We tested our algorithms on both synthetic and biological data. In particular, we used the FPT for computing the most likely ancestral mitochondrial genomes of hominidae ( the great apes), thereby answering an interesting biological question. Moreover, we show how the approximation algorithms find good solutions for reconstructing the ancestral genomes for a set of lentiviruses ( relatives of HIV). Supplementary material of this work is available at www.nada.kth.se/(similar to)isaac/publications/aml/aml.html.

Keyword
ancestral maximum likelihood, most parsimonious likelihood, PTAS, FPT, 2-approximation
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6356 (URN)10.1089/cmb.2006.0101 (DOI)000245943700006 ()2-s2.0-34248184703 (Scopus ID)
Note

QC 20110121. Uppdaterad från manuskript till artikel

Available from: 2006-11-15 Created: 2006-11-15 Last updated: 2017-12-14Bibliographically approved

Open Access in DiVA

fulltext(723 kB)493 downloads
File information
File name FULLTEXT01.pdfFile size 723 kBChecksum MD5
bf6a4ee7c650b76c8d5eed53b1e5fc57d7cb1399cf7aca020af1f5b54b5d8e7ad12521e0
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Elias, Isaac
By organisation
Numerical Analysis and Computer Science, NADA
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 493 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 567 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf