Change search
Refine search result
1 - 42 of 42
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Ali, Raja Hashim
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Burnin estimation and convergence assessment in Bayesian phylogenetic inferenceManuscript (preprint) (Other academic)
    Abstract [en]

     Convergence assessment and burnin estimation are central concepts in Markov chain Monte Carlo algorithms. Studies on eects, statistical properties, and comparisons between dierent convergence assessment methods have been conducted during the past few decades. However, not much work has been done on the eect of convergence diagnostic on posterior distribution of tree parameters and which method should be used by researchers in Bayesian phylogenetics inference. In this study, we propose and evaluate two novel burnin estimation methods that estimate burnin using all parameters jointly. We also consider some other popular convergence diagnostics, evaluate them in light of parallel chains and quantify the eect of burnin estimates from various convergence diagnostics on the posterior distribution of trees. We motivate the use of convergence diagnostics to assess convergence and estimate burnin in Bayesian phylogenetics inference and found out that it is better to employ convergence diagnostics rather than remove a xed percentage as burnin. We concluded that the last burnin estimator using eective sample size appears to estimate burnin better than all other convergence diagnostics.

  • 2.
    Ali, Raja Hashim
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Bark, Mikael
    KTH, School of Information and Communication Technology (ICT).
    Miro, Jorge
    KTH, School of Information and Communication Technology (ICT).
    Muhammad, Sayyed Auwn
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Sjöstrand, Joel
    Stockholm University.
    Zubair, Syed Muhammad
    KTH, School of Electrical Engineering (EES), Communication Networks. University of Balochistan, Pakistan.
    Abbas, Raja Manzar
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    VMCMC: a graphical and statistical analysis tool for Markov chain Monte Carlo tracesManuscript (preprint) (Other academic)
    Abstract [en]

    Motivation: MCMC-based methods are important for Bayesian inference of phylogeny and related parameters. Although being computationally expensive, MCMC yields estimates of posterior distributions that are useful for estimating parameter values and are easy to use in subsequent analysis. There are, however, sometimes practical diculties with MCMC, relating to convergence assessment and determining burn-in, especially in large-scale analyses. Currently, multiple software are required to perform, e.g., convergence, mixing and interactive exploration of both continuous and tree parameters.

    Results: We have written a software called VMCMC to simplify post-processing of MCMC traces with, for example, automatic burn-in estimation. VMCMC can also be used both as a GUI-based application, supporting interactive exploration, and as a command-line tool suitable for automated pipelines.

    Availability: VMCMC is available for Java SE 6+ under the New BSD License. Executable jar les, tutorial manual and source code can be downloaded from https://bitbucket.org/rhali/visualmcmc/.

  • 3.
    Ali, Raja Hashim
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Muhammad, Sayyed Auwn
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm2016In: BMC EVOLUTIONARY BIOLOGY, ISSN 1471-2148, Vol. 16Article in journal (Other academic)
    Abstract [en]

    Background: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. Results: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. Conclusions: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.

  • 4.
    Ali, Raja Hashim
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Muhammad, Sayyed Auwn
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Khan, Mehmodd Alam
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    Stockholms universitet.
    Quantitative synteny scoring improves homology inference and partitioning of gene families2013In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, p. S12-Article in journal (Refereed)
    Abstract [en]

    Background: Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Results: Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. Conclusions: The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

  • 5.
    Angleby, Helen
    et al.
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Oskarsson, Mattias
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Pang, Junfeng
    Zhang, Ya-ping
    Leitner, Thomas
    Braham, Caitlyn
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Lundeberg, Joakim
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Webb, Kristen M.
    Savolainen, Peter
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Forensic Informativity of similar to 3000bp of Coding Sequence of Domestic Dog mtDNA2014In: Journal of Forensic Sciences, ISSN 0022-1198, E-ISSN 1556-4029, Vol. 59, no 4, p. 898-908Article in journal (Refereed)
    Abstract [en]

    The discriminatory power of the noncoding control region (CR) of domestic dog mitochondrial DNA alone is relatively low. The extent to which the discriminatory power could be increased by analyzing additional highly variable coding regions of the mitochondrial genome (mtGenome) was therefore investigated. Genetic variability across the mtGenome was evaluated by phylogenetic analysis, and the three most variable similar to 1kb coding regions identified. We then sampled 100 Swedish dogs to represent breeds in accordance with their frequency in the Swedish population. A previously published dataset of 59 dog mtGenomes collected in the United States was also analyzed. Inclusion of the three coding regions increased the exclusion capacity considerably for the Swedish sample, from 0.920 for the CR alone to 0.964 for all four regions. The number of mtDNA types among all 159 dogs increased from 41 to 72, the four most frequent CR haplotypes being resolved into 22 different haplotypes.

  • 6.
    Arvestad, Lars
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Adapting to nature: some improvements on alignment algorithms in computational biology1997Licentiate thesis, monograph (Other scientific)
  • 7.
    Arvestad, Lars
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Algorithms for biological sequence alignment1999Doctoral thesis, monograph (Other scientific)
  • 8.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
    Efficient methods for estimating amino acid replacement rates2006In: Journal of Molecular Evolution, ISSN 0022-2844, E-ISSN 1432-1432, Vol. 62, no 6, p. 663-673Article in journal (Refereed)
    Abstract [en]

    Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Muller-Vingron's resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix.

  • 9.
    Arvestad, Lars
    et al.
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Berglund, Ann-Charlotte
    Stockholm Bioinformatics Center, Dept. of Biochemistry, Stockholm University.
    Lagergren, Jens
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Sennblad, Bengt
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution.2004In: Proceedings of the Annual International Conference on Computational Molecular Biology, RECOM, 2004, p. 326-335Conference paper (Refereed)
    Abstract [en]

    Gene tree and species tree reconstruction, orthology analysis and reconciliation, are problems important in multigenome-based comparative genomics and biology in general. In the present paper, we advance the frontier of these areas in several respects and provide important computational tools. First, exact algorithms are given for several probabilistic reconciliation problems with respect to the probabilistic gene evolutionmodel, previously developed by the authors. Until now, those problems were solved by MCMC estimation algorithms. Second, we extend the gene evolution model to the genesequence evolution model, by including sequence evolution. Third, we develop MCMC algorithms for the gene sequence evolution model that, given gene sequence data allows: (1) orthology analysis, reconciliation analysis, and gene tree reconstruction, w.r.t. a species tree, that balances a likely/unlikely reconciliation and a likely/unlikely genetree and (2) species tree reconstruction that balance a likely /unlikely reconciliation and a likely/unlikely gene trees. These MCMC algorithms take advantage of the exact algorithms for the gene evolution model. We have successfully tested our dynamical programming algorithms on real data for a biogeography problem. The MCMC algorithms perform very well both on synthetic and biological data.

  • 10.
    Arvestad, Lars
    et al.
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Bruno, William
    Los Alamos National Laboratory.
    Estimation of Reversible Substitution Matrices from Multiple Pairs of Sequences1997In: Journal of Molecular Evolution, ISSN 0022-2844, E-ISSN 1432-1432, Vol. 45, no 6, p. 696-703Article in journal (Refereed)
    Abstract [en]

    We present a method for estimating the most general reversible substitution matrix corresponding to a given collection of pairwise aligned DNA sequences. This matrix can then be used to calculate evolutionary distances between pairs of sequences in the collection. If only two sequences are considered, our method is equivalent to that of Lanave et al. (1984). The main novelty of our approach is in combining data from different sequence pairs. We describe a weighting method for pairs of taxa related by a known tree that results in uniform weights for all branches. Our method for estimating the rate matrix results in fast execution times, even on large data sets, and does not require knowledge of the phylogenetic relationships among sequences. In a test case on a primate pseudogene, the matrix we arrived at resembles one obtained using maximum likelihood, and the resulting distance measure is shown to have better linearity than is obtained in a less general model.

  • 11.
    Arvestad, Lars
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Sennblad, Bengt
    The Gene Evolution Model and Computing Its Associated Probabilities2009In: Journal of the ACM, ISSN 0004-5411, E-ISSN 1557-735X, Vol. 56, no 2Article in journal (Refereed)
    Abstract [en]

    Phylogeny is both a fundamental tool in biology and a rich source of fascinating modeling and algorithmic problems. Today's wealth of sequenced genomes makes it increasingly important to understand evolutionary events such as duplications, losses, transpositions, inversions, lateral transfers, and domain shuffling. We focus on the gene duplication event, that constitutes a major force in the creation of genes with new function [Ohno 1970; Lynch and Force 2000] and, thereby also, of biodiversity. We introduce the probabilistic gene evolution model, which describes how a gene tree evolves within a given species tree with respect to speciation, gene duplication, and gene loss. The actual relation between gene tree and species tree is captured by a reconciliation, a concept which we generalize for more expressiveness. The model is a canonical generalization of the classical linear birth-death process, obtained by replacing the interval where the process takes place by a tree. For the gene evolution model, we derive efficient algorithms for some associated probability distributions: the probability of a reconciled tree, the probability of a gene tree, the maximum probability reconciliation, the posterior probability of a reconciliation, and sampling reconciliations with respect to the posterior probability. These algorithms provides the basis for several applications, including species tree construction, reconciliation analysis, orthology analysis, biogeography, and host-parasite co-evolution.

  • 12.
    Arvestad, Lars
    et al.
    KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
    Visa, N.
    Lundeberg, Joakim
    KTH, School of Biotechnology (BIO), Gene Technology.
    Wieslander, L.
    Savolainen, Peter
    KTH, School of Biotechnology (BIO), Gene Technology.
    Expressed sequence tags from the midgut and an epithelial cell line of Chironomus tentans: annotation, bioinformatic classification of unknown transcripts and analysis of expression levels2005In: Insect molecular biology (Print), ISSN 0962-1075, E-ISSN 1365-2583, Vol. 14, no 6, p. 689-695Article in journal (Refereed)
    Abstract [en]

    Expressed sequence tags (ESTs) were generated from two Chironomus tentans cDNA libraries, constructed from an embryo epithelial cell line and from larva midgut tissue. 8584 5'-end ESTs were generated and assembled into 3110 tentative unique transcripts, providing the largest contribution of C. tentans sequences to public databases to date. Annotation using BLAST gave 1975 (63.5%) transcripts with a significant match in the major gene/protein databases, 1170 with a best match to Anopheles gambiae and 480 to Drosophila melanogaster. 1091 transcripts (35.1%) had no match to any database. Studies of open reading frames suggest that at least 323 of these contain a coding sequence, indicating that a large proportion of the genes in C. tentans belong to previously unknown gene families.

  • 13.
    Djerbi, Soraya
    et al.
    KTH, School of Biotechnology (BIO).
    Lindskog, Mats
    KTH, School of Biotechnology (BIO).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
    Sterky, Fredrik
    KTH, School of Biotechnology (BIO).
    Teeri, Tuula
    KTH, School of Biotechnology (BIO).
    The genome sequence of black cottonwood (Populus trichocarpa) reveals 18 conserved cellulose synthase (CesA) genes2005In: Planta, ISSN 0032-0935, E-ISSN 1432-2048, Vol. 221, no 5, p. 739-746Article in journal (Refereed)
    Abstract [en]

    The genome sequence of Populus trichocarpa was screened for genes encoding cellulose synthases by using full-length cDNA sequences and ESTs previously identified in the tissue specific cDNA libraries of other poplars. The data obtained revealed 18 distinct CesA gene sequences in P. trichocarpa. The identified genes were grouped in seven gene pairs, one group of three sequences and one single gene. Evidence from gene expression studies of hybrid aspen suggests that both copies of at least one pair, CesA3-1 and CesA3-2, are actively transcribed. No sequences corresponding to the gene pair, CesA6-1 and CesA6-2, were found in Arabidopsis or hybrid aspen, while one homologous gene has been identified in the rice genome and an active transcript in Populus tremuloides. A phylogenetic analysis suggests that the CesA genes previously associated with secondary cell wall synthesis originate from a single ancestor gene and group in three distinct subgroups. The newly identified copies of CesA genes in P. trichocarpa give rise to a number of new questions concerning the mechanism of cellulose synthesis in trees.

  • 14.
    Emanuelsson, Olof
    et al.
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    KTH, Centres, Science for Life Laboratory, SciLifeLab. Stockholm University.
    Käll, Lukas
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Engagera och aktivera studenter med inspiration från konferenser: examination genom poster-presentation2014In: Proceedings 2014, 8:e Pedagogiska inspirationskonferensen 17 december 2014 / [ed] Roy Andersson, Lund, 2014Conference paper (Refereed)
    Abstract [sv]

    I en forskningsnära kurs om 7.5 hp på master-nivå inom bioinformatikämnet vid KTH består drygt halva kursen av ett projekt som genomförs i grupper om tre studenter. Varje projekt har en egen projektuppgift med inget eller marginellt överlapp med andra gruppers uppgifter. Projekten är så gott som uteslutande baserade på aktuella frågeställningar i lärarteamets egna forskningsgrupper eller deras närhet. Projektet redovisas dels genom en posterpresentation, dels med individuell webbaserad projektdagbok. Vid posterredovisningen, som omfattar tre timmar i slutet av tentamensperioden, är alla kursdeltagare med. Vi försöker i möjligaste mån efterlikna situationen där ett autentiskt forskningsresultat presenteras på en riktig konferens. Varje deltagare (student) förväntas alltså ta del av varje annan grupps poster, på samma sätt som sker vid de flesta vetenskapliga konferenser. Vi genomför en enklare kamratbedömning på posternivå, där varje student ska avge en kort och konfidentiell kommentar om var och en av övriga postrar. Kursens lärare bedömer förstås också postrarna. En av svårigheterna är att sätta individuella betyg. Här använder vi oss av individuella projektdagböcker, som ger vägledning till de olika individernas insatser inom projektet. Vi har provat detta under fyra kursomgångar med som mest sju projekt. Examinationsformen är rolig och motiverande både för studenterna och lärarna.

  • 15. Frygelius, Jessica
    et al.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Wedell, Anna
    Tohonen, Virpi
    Evolution and human tissue expression of the Cres/Testatin subgroup genes, a reproductive tissue specific subgroup of the type 2 cystatins2010In: Evolution & Development, ISSN 1520-541X, E-ISSN 1525-142X, Vol. 12, no 3, p. 329-342Article in journal (Refereed)
    Abstract [en]

    P>The cystatin family comprises a group of generally broadly expressed protease inhibitors. The Cres/Testatin subgroup (CTES) genes within the type 2 cystatins differs from the classical type 2 cystatins in having a strikingly reproductive tissue-specific expression, and putative functions in reproduction have therefore been discussed. We have performed evolutionary studies of the CTES genes based on gene searches in genomes from 11 species. Ancestors of the cystatin family can be traced back to plants. We have localized the evolutionary origin of the CTES genes to the split of marsupial and placental mammals. A model for the evolution of these genes illustrates that they constitute a dynamic group of genes, which has undergone several gene expansions and we find indications of a high degree of positive selection, in striking contrast to what is seen for the classical cystatin C. We show with phylogenetic relations that the CTES genes are clustered into three original groups, a testatin, a Cres, and a CstL1 group. We have further characterized the expression patterns of all human members of the subfamily. Of a total of nine identified human genes, four express putative functional transcripts with a predominant expression in the male reproductive system. Our results are compatible with a function of this gene family in reproduction.

  • 16.
    Fugelstad, Johanna
    et al.
    KTH, School of Biotechnology (BIO), Glycoscience.
    Bouzenzana, Jamel
    Djerbi, Soraya
    Guerriero, Gea
    KTH, School of Biotechnology (BIO), Glycoscience.
    Ezcurra, Inés
    KTH, School of Biotechnology (BIO), Glycoscience.
    Teeri, Tuula T.
    KTH, School of Biotechnology (BIO), Glycoscience.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Bulone, Vincent
    KTH, School of Biotechnology (BIO), Glycoscience.
    Identification of the cellulose synthase genes from the Oomycete Saprolegnia monoica and effect of cellulose synthesis inhibitors on gene expression and enzyme activity2009In: Fungal Genetics and Biology, ISSN 1087-1845, E-ISSN 1096-0937, Vol. 46, no 10, p. 759-767Article in journal (Refereed)
    Abstract [en]

    Cellulose biosynthesis is a vital but yet poorly understood biochemical process in Oomycetes. Here, we report the identification and characterization of the cellulose synthase genes (CesA) from Saprolegnia monoica. Southern blot experiments revealed the occurrence of three CesA homologues in this species and phylogenetic analyses confirmed that Oomycete CesAs form a clade of their own. All gene products contained the D,D,D,QXXRW signature of most processive glycosyltransferases, including cellulose synthases. However, their N-terminal ends exhibited Oomycete-specific domains, i.e. Pleckstrin Homology domains, or conserved domains of an unknown function together with additional putative transmembrane domains. Mycelial growth was inhibited in the presence of the cellulose biosynthesis inhibitors 2,6-dichlorobenzonitrile or Congo Red. This inhibition was accompanied by a higher expression of all CesA genes in the mycelium and increased in vitro glucan synthase activities. Altogether, our data strongly suggest a direct involvement of the identified CesA genes in cellulose biosynthesis.

  • 17.
    Gustavsson, Martin
    et al.
    KTH, School of Biotechnology (BIO), Industrial Biotechnology.
    Jarmander, Johan
    KTH, School of Biotechnology (BIO), Industrial Biotechnology.
    Arvestad, Lars
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Larsson, Gen
    KTH, School of Biotechnology (BIO), Industrial Biotechnology.
    Extended signal peptides in autotransporters are associated with large passenger proteinsManuscript (preprint) (Other academic)
  • 18. Hollich, V.
    et al.
    Milchert, L.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
    Sonnhammer, E. L. L.
    Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction2005In: Molecular biology and evolution, ISSN 0737-4038, E-ISSN 1537-1719, Vol. 22, no 11, p. 2257-2264Article in journal (Refereed)
    Abstract [en]

    Distance-based methods are popular for reconstructing evolutionary trees of protein sequences, mainly because of their speed and generality. A number of variants of the classical neighbor-joining (NJ) algorithm have been proposed, as well as a number of methods to estimate protein distances. We here present a large-scale assessment of performance in reconstructing the correct tree topology for the most popular algorithms. The programs BIONJ, FastME, Weighbor, and standard NJ were run using 12 distance estimators, producing 48 tree-building/distance estimation method combinations. These were evaluated on a test set based on real trees taken from 100 Pfam families. Each tree was used to generate multiple sequence alignments with the ROSE program using three evolutionary models. The accuracy of each method was analyzed as a function of both sequence divergence and location in the tree. We found that BIONJ produced the overall best results, although the average accuracy differed little between the tree-building methods (normally less than 1%). A noticeable trend was that FastME performed poorer than the rest on long branches. Weighbor was several orders of magnitude slower than the other programs. Larger differences were observed when using different distance estimators. Protein-adapted Jukes-Cantor and Kimura distance correction produced clearly poorer results than the other methods, even worse than uncorrected distances. We also assessed the recently developed Scoredist measure, which performed equally well as more complex methods.

  • 19.
    Kahles, André
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Sarqume, Fahad
    KTH, School of Biotechnology (BIO).
    Savolainen, Peter
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab. Stockholms universitet.
    Excap: maximization of haplotypic diversity of linked markers2013In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 8, no 11, p. e79012-Article in journal (Refereed)
    Abstract [en]

    Genetic markers, defined as variable regions of DNA, can be utilized for distinguishing individuals or populations. As long as markers are independent, it is easy to combine the information they provide. For nonrecombinant sequences like mtDNA, choosing the right set of markers for forensic applications can be difficult and requires careful consideration. In particular, one wants to maximize the utility of the markers. Until now, this has mainly been done by hand. We propose an algorithm that finds the most informative subset of a set of markers. The algorithm uses a depth first search combined with a branch-and-bound approach. Since the worst case complexity is exponential, we also propose some data-reduction techniques and a heuristic. We implemented the algorithm and applied it to two forensic caseworks using mitochondrial DNA, which resulted in marker sets with significantly improved haplotypic diversity compared to previous suggestions. Additionally, we evaluated the quality of the estimation with an artificial dataset of mtDNA. The heuristic is shown to provide extensive speedup at little cost in accuracy.

  • 20.
    Khan, Mehmood Alam
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Phylogenetic Partitioning of Gene FamiliesManuscript (preprint) (Other academic)
    Abstract [en]

    Clustering and organizing molecular sequences is one of the central tasks in Bioinformatics. It is a common first step in, for example, phylogenomic analysis. For some tasks, a large gene family needs to be partitioned into more manageable subfamilies. In particular, Bayesian phylogenetic analysis can be very expensive. There is a need for easy and natural means of breaking up a gene family, with moderate computational requirements, to enable careful analysis of subfamilies with computationally expensive tools. We devised and implemented a method that infer and reconcile gene trees to species trees and identifies putative orthogroups as subfamilies. To achieve reasonable speed, approximate ML phylogenies are inferred using the FastTree method and combined with a subfamily-centered bootstrapping procedure to ensure robustness. Using the new method, very large clusters of sequences are now easier to manage in pipelines containing computationally expensive steps. The implementation of PhyloGenClust is available at a public repository, https://github.com/malagori/PhyloGenClust, under the GNU General Public License version 3. 

  • 21.
    Khan, Mehmood Alam
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Elias, Isaac
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Sjölund, Erik
    Stockholms universitet.
    Nylander, Kristina
    KTH, School of Computer Science and Communication (CSC).
    Guimera, Roman Valls
    Stockholms univetsitet.
    Schobesberger, Richard
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. University of Applied Sciences Upper Austria.
    Schmitzberger, Peter
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. University of Applied Sciences Upper Austria.
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    fastphylo: Fast tools for phylogenetics2013In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, no 1, p. 334-Article in journal (Refereed)
    Abstract [en]

    Background: Distance methods are ubiquitous tools in phylogenetics. Their primary purpose may be to reconstruct evolutionary history, but they are also used as components in bioinformatic pipelines. However, poor computational efficiency has been a constraint on the applicability of distance methods on very large problem instances. Results: We present fastphylo, a software package containing implementations of efficient algorithms for two common problems in phylogenetics: estimating DNA/protein sequence distances and reconstructing a phylogeny from a distance matrix. We compare fastphylo with other neighbor joining based methods and report the results in terms of speed and memory efficiency. Conclusions: Fastphylo is a fast, memory efficient, and easy to use software suite. Due to its modular architecture, fastphylo is a flexible tool for many phylogenetic studies.

  • 22.
    Khan, Mehmood Alam
    et al.
    KTH, School of Computer Science and Communication (CSC). KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Mahmudi, Owais
    KTH, School of Computer Science and Communication (CSC). KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Ulah, Ikram
    KTH, School of Computer Science and Communication (CSC). KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    KTH, Centres, Science for Life Laboratory, SciLifeLab. KTH, Centres, SeRC - Swedish e-Science Research Centre. Stockholm Univ, Sweden.
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Probabilistic inference of lateral gene transfer events2016In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 17, article id 431Article in journal (Refereed)
    Abstract [en]

    Background: Lateral gene transfer (LGT) is an evolutionary process that has an important role in biology. It challenges the traditional binary tree-like evolution of species and is attracting increasing attention of the molecular biologists due to its involvement in antibiotic resistance. A number of attempts have been made to model LGT in the presence of gene duplication and loss, but reliably placing LGT events in the species tree has remained a challenge. Results: In this paper, we propose probabilistic methods that samples reconciliations of the gene tree with a dated species tree and computes maximum a posteriori probabilities. The MCMC-based method uses the probabilistic model DLTRS, that integrates LGT, gene duplication, gene loss, and sequence evolution under a relaxed molecular clock for substitution rates. We can estimate posterior distributions on gene trees and, in contrast to previous work, the actual placement of potential LGT, which can be used to, e.g., identify "highways" of LGT. Conclusions: Based on a simulation study, we conclude that the method is able to infer the true LGT events on gene tree and reconcile it to the correct edges on the species tree in most cases. Applied to two biological datasets, containing gene families from Cyanobacteria and Molicutes, we find potential LGTs highways that corroborate other studies as well as previously undetected examples.

  • 23.
    Nystedt, Björn
    et al.
    Stockholm University.
    Vezzi, Francesco
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Alekseenko, Andrey
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Sahlin, Kristoffer
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Hällman, Jimmie
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Käller, Max
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Rilakovic, Nemanja
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Lundeberg, Joakim
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    et, al,
    The Norway spruce genome sequence and conifer genome evolution2013In: Nature, ISSN 0028-0836, E-ISSN 1476-4687, Vol. 497, no 7451, p. 579-584Article in journal (Refereed)
    Abstract [en]

    Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.

  • 24.
    Rajangam, Alex S.
    et al.
    KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Kumar, Manoj
    Aspeborg, Henrik
    KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Guerriero, Gea
    KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Pansri, Podjamas
    KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Brown, Christian J. L.
    KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Hober, Sophia
    KTH, School of Biotechnology (BIO), Proteomics (closed 20130101).
    Blomqvist, Kristina
    KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Divne, Christina
    KTH, School of Biotechnology (BIO), Glycoscience. KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Ezcurra, Inés
    KTH, School of Biotechnology (BIO), Glycoscience. KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Mellerowicz, Ewa
    Sundberg, Bjorn
    Bulone, Vincent
    KTH, School of Biotechnology (BIO), Glycoscience. KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    Teeri, Tuula T.
    KTH, School of Biotechnology (BIO), Glycoscience. KTH, School of Biotechnology (BIO), Centres, Swedish Center for Biomimetic Fiber Engineering, BioMime.
    MAP20, a Microtubule-Associated Protein in the Secondary Cell Walls of Hybrid Aspen, Is a Target of the Cellulose Synthesis Inhibitor 2,6-Dichlorobenzonitrile2008In: Plant Physiology, ISSN 0032-0889, E-ISSN 1532-2548, Vol. 148, no 3, p. 1283-1294Article in journal (Refereed)
    Abstract [en]

    We have identified a gene, denoted PttMAP20, which is strongly up-regulated during secondary cell wall synthesis and tightly coregulated with the secondary wall-associated CESA genes in hybrid aspen (Populus tremula x tremuloides). Immunolocalization studies with affinity-purified antibodies specific for PttMAP20 revealed that the protein is found in all cell types in developing xylem and that it is most abundant in cells forming secondary cell walls. This PttMAP20 protein sequence contains a highly conserved TPX2 domain first identified in a microtubule-associated protein (MAP) in Xenopus laevis. Overexpression of PttMAP20 in Arabidopsis (Arabidopsis thaliana) leads to helical twisting of epidermal cells, frequently associated with MAPs. In addition, a PttMAP20-yellow fluorescent protein fusion protein expressed in tobacco (Nicotiana tabacum) leaves localizes to microtubules in leaf epidermal pavement cells. Recombinant PttMAP20 expressed in Escherichia coli also binds specifically to in vitro-assembled, taxol-stabilized bovine microtubules. Finally, the herbicide 2,6-dichlorobenzonitrile, which inhibits cellulose synthesis in plants, was found to bind specifically to PttMAP20. Together with the known function of cortical microtubules in orienting cellulose microfibrils, these observations suggest that PttMAP20 has a role in cellulose biosynthesis.

  • 25.
    Rajangam, Alex
    et al.
    KTH, School of Biotechnology (BIO).
    Yang, Hongqian
    KTH, School of Computer Science and Communication (CSC).
    Teeri, Tuula
    KTH, School of Biotechnology (BIO).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC).
    Evolution of a domain conserved in microtubule-associated proteins of eukaryotes2008In: Advances and Applications in Bioinformatics and Chemistry, ISSN 1178-6949, Vol. 1, no 1, p. 51-69Article in journal (Refereed)
    Abstract [en]

    The microtubule network, the major organelle of the eukaryotic cytoskeleton, is involved in cell division and differentiation but also with many other cellular functions. In plants, microtubules seem to be involved in the ordered deposition of cellulose microfibrils by a so far unknown mechanism. Microtubule-associated proteins (MAP) typically contain various domains targeting or binding proteins with different functions to microtubules. Here we have investigated a proposed microtubule-targeting domain, TPX2, first identified in the Kinesin-like protein 2 in Xenopus. A TPX2 containing microtubule binding protein, PttMAP20, has been recently identified in poplar tissues undergoing xylogenesis. Furthermore, the herbicide 2,6-dichlorobenzonitrile (DCB), which is a known inhibitor of cellulose synthesis, was shown to bind specifically to PttMAP20. It is thus possible that PttMAP20 may have a role in coupling cellulose biosynthesis and the microtubular networks in poplar secondary cell walls. In order to get more insight into the occurrence, evolution and potential functions of TPX2-containing proteins we have carried out bioinformatic analysis for all genes so far found to encode TPX2 domains with special reference to poplar PttMAP20 and its putative orthologs in other plants.

  • 26. Roth, Christian
    et al.
    Rastogi, Shruti
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Dittmar, Katharina
    Light, Sara
    Ekman, Diana
    Liberles, David A.
    Evolution after gene duplication: Models, mechanisms, sequences, systems, and organisms2007In: Journal of Experimental Zoology Part B-Molecular and Developmental Evolution, ISSN 1552-5007, Vol. 308B, no 1, p. 58-73Article, review/survey (Refereed)
    Abstract [en]

    Gene duplication is postulated to have played a major role in the evolution of biological novelty. Here, gene duplication is examined across levels of biological organization in an attempt to create a unified picture of the mechanistic process by which gene duplication can have played a role in generating biodiversity. Neofunctionalization and subfunctionalization have been proposed as important processes driving the retention of duplicate genes. These models have foundations in population genetic theory, which is now being refined by explicit consideration of the structural constraints placed upon genes encoding proteins through physical chemistry. Further, such models can be examined in the context of comparative genomics, where an integration of gene-level evolution and species-level evolution allows an assessment of the frequency of duplication and the fate of duplicate genes. This process, of course, is dependent upon the biochemical role that duplicated genes play in biological systems, which is in turn dependent upon the mechanism of duplication: whole genome duplication involving a co-duplication of interacting partners vs. single gene duplication. Lastly, the role that these processes may have played in driving speciation is examined.

  • 27.
    Sahlin, Kristoffer
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Chikhi, Rayan
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Genome scaffolding with PE-contaminated mate-pair libraries2015Manuscript (preprint) (Other academic)
  • 28.
    Sahlin, Kristoffer
    et al.
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Frånberg, M.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). KTH, Centres, SeRC - Swedish e-Science Research Centre. Stockholm University, Sweden.
    Structural Variation Detection with Read Pair Information: An Improved Null Hypothesis Reduces Bias2017In: Journal of Computational Biology, ISSN 1066-5277, E-ISSN 1557-8666, Vol. 24, no 6, p. 581-589Article in journal (Refereed)
    Abstract [en]

    Reads from paired-end and mate-pair libraries are often utilized to find structural variation in genomes, and one common approach is to use their fragment length for detection. After aligning read pairs to the reference, read pair distances are analyzed for statistically significant deviations. However, previously proposed methods are based on a simplified model of observed fragment lengths that does not agree with data. We show how this model limits statistical analysis of identifying variants and propose a new model by adapting a model we have previously introduced for contig scaffolding, which agrees with data. From this model, we derive an improved null hypothesis that when applied in the variant caller CLEVER, reduces the number of false positives and corrects a bias that contributes to more deletion calls than insertion calls. We advise developers of variant callers with statistical fragment length-based methods to adapt the concepts in our proposed model and null hypothesis.

  • 29.
    Sahlin, Kristoffer
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Frånberg, Mattias
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Correcting bias from stochastic insert size in read pair data—applications to structural variation detection and genome assembly2015Manuscript (preprint) (Other academic)
  • 30.
    Sahlin, Kristoffer
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Street, Nathaniel
    Lundeberg, Joakim
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Improved gap size estimation for scaffolding algorithms2012In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 28, no 17, p. 2215-2222Article in journal (Refereed)
    Abstract [en]

    Motivation: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance.Results: In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners.

  • 31.
    Savolainen, Peter
    et al.
    KTH, Superseded Departments, Biotechnology.
    Arvestad, Lars
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Lundeberg, Joakim
    KTH, Superseded Departments, Biotechnology.
    A novel method for forensic DNA investigations: Repeat-type sequence analysis of tandemly repeated mtDNA in domestic dogs2000In: Journal of Forensic Sciences, ISSN 0022-1198, E-ISSN 1556-4029, Vol. 45, no 5, p. 990-999Article in journal (Refereed)
    Abstract [en]

    A highly variable and heteroplasmic tandem repeat region situated in the mitochondrial mt DNA control region (CR) in domestic dogs and wolves was studied to evaluate its suitability as a forensic genetic marker for analysis of single hairs. The tandem repeat array is composed of three 10-bp repeat types that are distributed so that a secondary DNA sequence is formed. Thus, the region presents two levels of variation: variation in the number of repeats and variation in the secondary DNA sequence of repeat types. Two analysis methods were therefore tested; fragment length analysis and analysis of the sequence of repeat types. Fragment analysis produced unique profiles that could be used to discriminate between blood samples from maternally closely related individuals. However, different hairs from one individual did not have the same fragment profile, and the method is, therefore, not suitable for analysis of single hairs. In contrast, analysis of the repeat type sequences (array types) is highly informative. When different hairs from one individual were studied, identical array types were found. The repeat-type sequence variation was studied among individuals having identical nonrepetitive CR mtDNA sequence variants. Seven, six, and two individuals, representing three different sequence variants, respectively, were analyzed. All these individuals had different array types, which implies a very high genetic variation between individuals in this region. The analysis method considerably improves the exclusion capacity of mtDNA analysis of domestic dogs compared with sequence analysis of non-repetitive DNA.

  • 32.
    Savolainen, Peter
    et al.
    KTH, Superseded Departments, Biotechnology.
    Arvestad, Lars
    KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
    Lundeberg, Joakim
    KTH, Superseded Departments, Biotechnology.
    mtDNA tandem repeats in domestic dogs and wolves: Mutation mechanism studied by analysis of the sequence of imperfect repeats2000In: Molecular biology and evolution, ISSN 0737-4038, E-ISSN 1537-1719, Vol. 17, no 4, p. 474-488Article in journal (Refereed)
    Abstract [en]

    The mitochondrial (mt) DNA control region (CR) of dogs and wolves contains an array of imperfect 10 bp tandem repeats. This region was studied for 14 domestic dogs representing the four major phylogenetic groups of nonrepetitive CR and for 5 wolves. Three repeat types were found among these individuals, distributed so that different sequences of the repeat types were formed in different molecules. This enabled a detailed study of the arrays and of the mutation events that they undergo. Extensive heteroplasmy was observed in all individuals; 85 different array types were found in one individual, and the total number of types was estimated at 384. Among unrelated individuals, no identical molecules were found, indicating a high rate of evolution of the region. By performing a pedigree analysis, array types which had been inherited from mother to offspring and array types which were the result of somatic mutations, respectively, could be identified, showing that about 20% of the molecules within an individual had somatic mutations. By direct pairwise comparison of the mutated and the original array types, the physiognomy of the inserted or deleted elements (indels) and the approximate positions of the mutations could be determined. All mutations could be explained by replication slippage or point mutations. The majority of the indels were 1-5 repeats long, but deletions of up to 17 repeats were found. Mutations were found in all parts of the arrays, but at a higher frequency in the 5' end. Furthermore, the inherited array types within the mother-offspring pair were aligned and compared so that germ line mutations could be studied. The pattern of the germ line mutations was approximately the same as that of the somatic mutations.

  • 33.
    Savolainen, Peter
    et al.
    KTH, School of Biotechnology (BIO), Gene Technology.
    Fitzsimmons, C.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Numerical Analysis and Computer Science, NADA.
    Andersson, L.
    Lundeberg, Joakim
    KTH, School of Biotechnology (BIO), Gene Technology.
    ESTs from brain and testis of White Leghorn and red junglefowl: annotation, bioinformatic classification of unknown transcripts and analysis of expression levels2005In: Cytogenetic and Genome Research, ISSN 1424-8581, E-ISSN 1424-859X, Vol. 111, no 1, p. 79-87Article in journal (Refereed)
    Abstract [en]

    We report the generation, assembly and annotation of expressed sequence tags (ESTs) from four chicken cDNA libraries, constructed from brain and testis tissue dissected from red junglefowl and White Leghorn. 21,285 5'-end ESTs were generated and assembled into 2,813 contigs and 9,737 singletons, giving 12,549 tentative unique transcripts. The transcripts were annotated using BLAST by matching to known chicken genes or to putative homologues in other species using the major gene/protein databases. The results for these similarity searches are available on www.sbc.su.se/ -arve/chicken. 4,129 (32.9%) of the transcripts remained without a significant match to gene/protein databases, a proportion of unmatched transcripts similar to earlier non-mammalian EST studies. To estimate how many of these transcripts may represent novel genes, they were studied for the presence of coding sequence. It was shown that most of the unique chicken transcripts do not contain coding parts of genes, but it was estimated that at least 400 of the transcripts contain coding sequence, indicating that 3.2% of avian genes belong to previously unknown gene families. Further BLAST search against dbEST left 1,649 (13.1 %) of the transcripts unmatched to any library. The number of completely unmatched transcripts containing coding sequence was estimated at 180, giving a measure of the number of putative novel chicken genes identified in this study. 84.3 % of the identified transcripts were found only in testis tissue, which has been poorly studied in earlier chicken EST studies. Large differences in expression levels were found between the brain and testis libraries for a large number of transcripts, and among the 525 most frequently represented transcripts, there were at least 20 transcripts with significant difference in expression levels between red junglefowl and White Leghorn

  • 34. Sennblad, Bengt
    et al.
    Schreil, Eva
    Berglund Sonnhammer, Ann-Charlotte
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC).
    primetv: a viewer for reconciled trees2007In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 8Article in journal (Refereed)
    Abstract [en]

    Background: Evolutionary processes, such as gene family evolution or parasite-host cospeciation, can often be viewed as a tree evolving inside another tree. Relating two given trees under such a constraint is known as reconciling them. Adequate software tools for generating illustrations of tree reconciliations are instrumental for presenting and communicating results and ideas regarding these phenomena. Available visualization tools have been limited to illustrations of the most parsimonious reconciliation. However, there exists a plethora of biologically relevant non-parsimonious reconciliations. Illustrations of these general reconciliations may not be achieved without manual editing. Results: We have developed a new reconciliation viewer, primetv. It is a simple and compact visualization program that is the first automatic tool for illustrating general tree reconciliations. It reads reconciled trees in an extended Newick format and outputs them as tree-within-tree illustrations in a range of graphic formats. Output attributes, such as colors and layout, can easily be adjusted by the user. To enhance the construction of input to primetv, two helper programs, readReconciliation and reconcile, accompany primetv. Detailed examples of all programs' usage are provided in the text. For the casual user a web-service provides a simple user interface to all programs. Conclusion: With primetv, the first visualization tool for general reconciliations, illustrations of trees-within-trees are easy to produce. Because it clarifies and accentuates an underlying structure in a reconciled tree, e. g., the impact of a species tree on a gene-family phylogeny, it will enhance scientific presentations as well as pedagogic illustrations in an educational setting. primetv is available at http://prime.sbc.su.se/primetv, both as a standalone command-line tool and as a web service. The software is distributed under the GNU General Public License.

  • 35.
    Sjöstrand, Joel
    et al.
    Department of Numerical Analysis and Computer Science, Stockholm University, Stockholm, Sweden .
    Arvestad, Lars
    Department of Numerical Analysis and Computer Science, Stockholm University, Stockholm, Sweden .
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Sennblad, Bengt
    Department of Medicine, Karolinska Institutet, Atherosclerosis Research Unit, Stockholm, Sweden .
    GenPhyloData: realistic simulation of gene family evolution2013In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, no 1, p. 209-Article in journal (Refereed)
    Abstract [en]

    Background: PrIME-GenPhyloData is a suite of tools for creating realistic simulated phylogenetic trees, in particular for families of homologous genes. It supports generation of trees based on a birth-death process and-perhaps more interestingly-also supports generation of gene family trees guided by a known (synthetic or biological) species tree while accounting for events such as gene duplication, gene loss, and lateral gene transfer (LGT). The suite also supports a wide range of branch rate models enabling relaxation of the molecular clock. Result: Simulated data created with PrIME-GenPhyloData can be used for benchmarking phylogenetic approaches, or for characterizing models or model parameters with respect to biological data. Conclusion: The concept of tree-in-tree evolution can also be used to model, for instance, biogeography or host-parasite co-evolution.

  • 36.
    Sjöstrand, Joel
    et al.
    Dept. of Numerical Analysis and Computer Science, Stockholm University.
    Sennblad, Bengt
    Karolinska Institutet.
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    DLRS: gene tree evolution in light of a species tree2012In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 28, no 22, p. 2994-2995Article in journal (Refereed)
    Abstract [en]

    PrIME-DLRS (or colloquially: 'Delirious') is a phylogenetic software tool to simultaneously infer and reconcile a gene tree given a species tree. It accounts for duplication and loss events, a relaxed molecular clock and is intended for the study of homologous gene families, for example in a comparative genomics setting involving multiple species. PrIME-DLRS uses a Bayesian MCMC framework, where the input is a known species tree with divergence times and a multiple sequence alignment, and the output is a posterior distribution over gene trees and model parameters.

  • 37. Sjöstrand, Joel
    et al.
    Tofigh, Ali
    Daubin, Vincent
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Sennblad, Bengt
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    A Bayesian Method for Analyzing Lateral Gene Transfer2014In: Systematic Biology, ISSN 1063-5157, E-ISSN 1076-836X, Vol. 63, no 3, p. 409-420Article in journal (Refereed)
    Abstract [en]

    Lateral gene transfer (LGT)uwhich transfers DNA between two non-vertically related individuals belonging to the same or different speciesuis recognized as a major force in prokaryotic evolution, and evidence of its impact on eukaryotic evolution is ever increasing. LGT has attracted much public attention for its potential to transfer pathogenic elements and antibiotic resistance in bacteria, and to transfer pesticide resistance from genetically modified crops to other plants. In a wider perspective, there is a growing body of studies highlighting the role of LGT in enabling organisms to occupy new niches or adapt to environmental changes. The challenge LGT poses to the standard tree-based conception of evolution is also being debated. Studies of LGT have, however, been severely limited by a lack of computational tools. The best currently available LGT algorithms are parsimony-based phylogenetic methods, which require a pre-computed gene tree and cannot choose between sometimes wildly differing most parsimonious solutions. Moreover, in many studies, simple heuristics are applied that can only handle putative orthologs and completely disregard gene duplications (GDs). Consequently, proposed LGT among specific gene families, and the rate of LGT in general, remain debated. We present a Bayesian Markov-chain Monte Carlo-based method that integrates GD, gene loss, LGT, and sequence evolution, and apply the method in a genome-wide analysis of two groups of bacteria: Mollicutes and Cyanobacteria. Our analyses show that although the LGT rate between distant species is high, the net combined rate of duplication and close-species LGT is on average higher. We also show that the common practice of disregarding reconcilability in gene tree inference overestimates the number of LGT and duplication events. [Bayesian; gene duplication; gene loss; horizontal gene transfer; lateral gene transfer; MCMC; phylogenetics.].

  • 38.
    Stranneheim, Henrik
    et al.
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Kaller, Max
    Allander, Tobias
    Andersson, Björn
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Lundeberg, Joakim
    KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Classification of DNA sequences using Bloom filters2010In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 26, no 13, p. 1595-1600Article in journal (Refereed)
    Abstract [en]

    Motivation: New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the 'novel' sequences in a complex dataset that are of interest and the superfluous sequences need to be removed. Results: A novel algorithm, fast and accurate classification of sequences (FACSs), is introduced that can accurately and rapidly classify sequences as belonging or not belonging to a reference sequence. FACS was first optimized and validated using a synthetic metagenome dataset. An experimental metagenome dataset was then used to show that FACS achieves comparable accuracy as BLAT and SSAHA2 but is at least 21 times faster in classifying sequences.

  • 39.
    Svensson, Örjan
    et al.
    KTH, School of Computer Science and Communication (CSC).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC).
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC).
    Genome-wide survey for biologically functional pseudogenes2006In: PloS Computational Biology, ISSN 1553-734X, E-ISSN 1553-7358, Vol. 2, no 5, p. 358-369Article in journal (Refereed)
    Abstract [en]

    According to current estimates there exist about 20,000 pseudogenes in a mammalian genome. The vast majority of these are disabled and nonfunctional copies of protein-coding genes which, therefore, evolve neutrally. Recent findings that a Makorin1 pseudogene, residing on mouse Chromosome 5, is, indeed, in vivo vital and also evolutionarily preserved, encouraged us to conduct a genome-wide survey for other functional pseudogenes in human, mouse, and chimpanzee. We identify to our knowledge the first examples of conserved pseudogenes common to human and mouse, originating from one duplication predating the human-mouse species split and having evolved as pseudogenes since the species split. Functionality is one possible way to explain the apparently contradictory properties of such pseudogene pairs, i. e., high conservation and ancient origin. The hypothesis of functionality is tested by comparing expression evidence and synteny of the candidates with proper test sets. The tests suggest potential biological function. Our candidate set includes a small set of long-lived pseudogenes whose unknown potential function is retained since before the human - mouse species split, and also a larger group of primate-specific ones found from human - chimpanzee searches. Two processed sequences are notable, their conservation since the human - mouse split being as high as most protein-coding genes; one is derived from the protein Ataxin 7- like 3 ( ATX7NL3), and one from the Spinocerebellar ataxia type 1 protein (ATX1). Our approach is comparative and can be applied to any pair of species. It is implemented by a semi-automated pipeline based on cross- species BLAST comparisons and maximum-likelihood phylogeny estimations. To separate pseudogenes from protein- coding genes, we use standard methods, utilizing in- frame disablements, as well as a probabilistic filter based on Ka/ Ks ratios.

  • 40. Vicedomini, Riccardo
    et al.
    Vezzi, Francesco
    KTH, School of Computer Science and Communication (CSC). KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Scalabrin, Simone
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Policriti, Alberto
    GAM-NGS: genomic assemblies merger for next generation sequencing2013In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, p. S6-Article in journal (Refereed)
    Abstract [en]

    Background: In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions. Results: GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools. Conclusions: The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct.

  • 41.
    Winzell, Anders
    et al.
    KTH, School of Biotechnology (BIO).
    Rajangam, Alex
    KTH, School of Biotechnology (BIO).
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC).
    Filling, Charlotta
    KTH, School of Biotechnology (BIO).
    Divine, Christina
    KTH, School of Biotechnology (BIO).
    Aspeborg, Henrik
    KTH, School of Biotechnology (BIO).
    Master, Emma R.
    KTH, School of Biotechnology (BIO).
    Teeri, Tuula T.
    KTH, School of Biotechnology (BIO).
    Sequence Analysis and Recombinant Expression of Family 43 GlycosyltransferasesManuscript (preprint) (Other academic)
  • 42.
    Åkerborg, Örjan
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Sennblad, Bengt
    Arvestad, Lars
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Lagergren, Jens
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Simultaneous Bayesian gene tree reconstruction and reconciliation analysis2009In: Proceedings of the National Academy of Sciences of the United States of America, ISSN 0027-8424, E-ISSN 1091-6490, Vol. 106, p. 5714-5719Article in journal (Refereed)
    Abstract [en]

    We present GSR, a probabilistic model integrating gene duplication, sequence evolution, and a relaxed molecular clock for substitution rates, that enables genomewide analysis of gene families. The gene duplication and loss process is a major cause for incongruence between gene and species tree, and deterministic methods have been developed to explain such differences through tree reconciliations. Although probabilistic methods for phylogenetic inference have been around for decades, probabilistic reconciliation methods are far less established. Based on our model, we have implemented a Bayesian analysis tool, PrIME-GSR, for gene tree inference that takes a known species tree into account. Our implementation is sound and we demonstrate its utility for genomewide gene-family analysis by applying it to recently presented yeast data. We validate PrIME-GSR by comparing with previous analyses of these data that take advantage of gene order information. In a case study we apply our method to the ADH gene family and are able to draw biologically relevant conclusions concerning gene duplications creating key yeast phenotypes. On a higher level this shows the biological relevance of our method. The obtained results demonstrate the value of a relaxed molecular clock. Our good performance will extend to species where gene order conservation is insufficient.

1 - 42 of 42
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf