Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
The bulk and the tail of minimal absent words in genome sequences
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). Aalto University, Finland. (Computational Biological Physics, CBP)
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). The Hebrew University of Jerusalem, Israel. (Computational Biological Physics, CBP)
State Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.
2016 (engelsk)Inngår i: Physical Biology, ISSN 1478-3967, E-ISSN 1478-3975, Vol. 13, nr 2, artikkel-id 026004Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. % In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the novel concept of a core of a minimal absent word, which are sequences present in the genome and closest to a given MAW. We show that in bacteria and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs (rRNAs). We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.

sted, utgiver, år, opplag, sider
Institute of Physics (IOP), 2016. Vol. 13, nr 2, artikkel-id 026004
Emneord [en]
Minimal absent word, copy-mutation evolution model, random sequence
HSV kategori
Identifikatorer
URN: urn:nbn:se:kth:diva-173501DOI: 10.1088/1478-3975/13/2/026004ISI: 000376415400009PubMedID: 27043075Scopus ID: 2-s2.0-84969930876OAI: oai:DiVA.org:kth-173501DiVA, id: diva2:853727
Forskningsfinansiär
Swedish Research Council, 621-2012-2982
Merknad

QC 20161102

Tilgjengelig fra: 2015-09-14 Laget: 2015-09-13 Sist oppdatert: 2018-01-11bibliografisk kontrollert
Inngår i avhandling
1. Data Analysis and Next Generation Sequencing : Applications in Microbiology.
Åpne denne publikasjonen i ny fane eller vindu >>Data Analysis and Next Generation Sequencing : Applications in Microbiology.
2015 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it.

The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data.

Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions.

Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.

sted, utgiver, år, opplag, sider
Stockholm: KTH Royal Institute of Technology, 2015. s. xviii, 154
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 2015:15
Emneord
RNA-seq, tagRNA-seq, primary and processed RNA, Enterococcus faecalis, Complex transcription, Metagenomics, 5'tagRACE, minimal absent words, compressed sensing, metagenomics, bacterial community reconstruction
HSV kategori
Forskningsprogram
Biologisk fysik
Identifikatorer
urn:nbn:se:kth:diva-173219 (URN)978-91-7595-699-2 (ISBN)
Disputas
2015-10-30, FA32, Roslagstullsbacken 21, Stockholm, 14:00 (engelsk)
Opponent
Veileder
Merknad

QC 20150930

Tilgjengelig fra: 2015-09-30 Laget: 2015-09-07 Sist oppdatert: 2018-01-11bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Andre lenker

Forlagets fulltekstPubMedScopus

Søk i DiVA

Av forfatter/redaktør
Aurell, ErikInnocenti, Nicolas
Av organisasjonen
I samme tidsskrift
Physical Biology

Søk utenfor DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric

doi
pubmed
urn-nbn
Totalt: 839 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf