Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Data Analysis and Next Generation Sequencing : Applications in Microbiology.
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB. (Computational Biological Physics, CBP)
2015 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it.

The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data.

Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions.

Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2015. , s. xviii, 154
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 2015:15
Nyckelord [en]
RNA-seq, tagRNA-seq, primary and processed RNA, Enterococcus faecalis, Complex transcription, Metagenomics, 5'tagRACE, minimal absent words, compressed sensing, metagenomics, bacterial community reconstruction
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Mikrobiologi Annan biologi Genetik
Forskningsämne
Biologisk fysik
Identifikatorer
URN: urn:nbn:se:kth:diva-173219ISBN: 978-91-7595-699-2 (tryckt)OAI: oai:DiVA.org:kth-173219DiVA, id: diva2:854436
Disputation
2015-10-30, FA32, Roslagstullsbacken 21, Stockholm, 14:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20150930

Tillgänglig från: 2015-09-30 Skapad: 2015-09-07 Senast uppdaterad: 2018-01-11Bibliografiskt granskad
Delarbeten
1. Lognormality and oscillations in the coverage of high-throughput transcriptomic data towards gene ends
Öppna denna publikation i ny flik eller fönster >>Lognormality and oscillations in the coverage of high-throughput transcriptomic data towards gene ends
2013 (Engelska)Ingår i: Journal of Statistical Mechanics: Theory and Experiment, ISSN 1742-5468, E-ISSN 1742-5468, Vol. 2013, nr 10, s. P10013-Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

High-throughput transcriptomics experiments have reached the stage where the count of the number of reads alignable to a given position can be treated as an almost-continuous signal. This allows us to ask questions of biophysical/biotechnical nature, but which may still have biological implications. Here we show that when sequencing RNA fragments from one end, as is the case on most platforms, an oscillation in the read count is observed at the other end. We further show that these oscillations can be well described by Kolmogorov's 1941 broken stick model. We investigate how the model can be used to improve predictions of gene ends (3' transcript ends), but conclude that with present data the improvement is only marginal. The results highlight subtle effects in high-throughput transcriptomics experiments which do not have a biological origin, but which may still be used to obtain biological information.

Ort, förlag, år, upplaga, sidor
Institute of Physics (IOP), 2013
Nyckelord
Modelling, Artefacts, RNAseq, Log-normal distribution, Broken stick, 3'end of transcripts
Nationell ämneskategori
Bioinformatik och systembiologi Annan fysik
Forskningsämne
SRA - Molekylär biovetenskap
Identifikatorer
urn:nbn:se:kth:diva-136077 (URN)10.1088/1742-5468/2013/10/P10013 (DOI)000326869000014 ()2-s2.0-84888614314 (Scopus ID)
Anmärkning

QC 20131220

Tillgänglig från: 2013-12-03 Skapad: 2013-12-03 Senast uppdaterad: 2017-12-06Bibliografiskt granskad
2. SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition
Öppna denna publikation i ny flik eller fönster >>SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition
Visa övriga...
2014 (Engelska)Ingår i: Bioinformatics, ISSN 1460-2059, Vol. 30, nr 17, s. 2423-2431Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment.

Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.

Availability and implementation: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site.

Ort, förlag, år, upplaga, sidor
Oxford University Press, 2014
Nyckelord
bacterial community composition, sparsity, metagenomics
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-152814 (URN)10.1093/bioinformatics/btu320 (DOI)000342912400046 ()24812337 (PubMedID)2-s2.0-84907029456 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet
Anmärkning

QC 20141023

Tillgänglig från: 2014-10-01 Skapad: 2014-10-01 Senast uppdaterad: 2020-03-09Bibliografiskt granskad
3. Whole-genome mapping of 5′ RNA ends in bacteria by tagged sequencing: a comprehensive view in Enterococcus faecalis
Öppna denna publikation i ny flik eller fönster >>Whole-genome mapping of 5′ RNA ends in bacteria by tagged sequencing: a comprehensive view in Enterococcus faecalis
Visa övriga...
2015 (Engelska)Ingår i: RNA, ISSN 1355-8382Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Enterococcus faecalis is the third cause of nosocomial infections. To obtain the first snapshot of transcriptional organizations in this bacterium, we used a modified RNA-seq approach enabling to discriminate primary from processed 5' RNA ends. We also validated our approach by confirming known features in Escherichia coli. We mapped 559 transcription start sites (TSSs) and 352 processing sites (PSSs) in E. faecalis. A blind motif search retrieved canonical features of SigA-and SigN-dependent promoters preceding transcription start sites mapped. We discovered 85 novel putative regulatory RNAs, small-and antisense RNAs, and 72 transcriptional antisense organizations. Presented data constitute a significant insight into bacterial RNA landscapes and a step toward the inference of regulatory processes at transcriptional and post-transcriptional levels in a comprehensive manner.

Ort, förlag, år, upplaga, sidor
RNA Society, 2015
Nyckelord
primary RNA, processed RNA, promoter, RNA degradation, Enterococcus faecalis
Nationell ämneskategori
Bioinformatik och systembiologi Mikrobiologi Genetik
Forskningsämne
Biologisk fysik
Identifikatorer
urn:nbn:se:kth:diva-163570 (URN)10.1261/rna.048470.114 (DOI)000353068400022 ()25737579 (PubMedID)2-s2.0-84928006918 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet, 621-2012-2982
Anmärkning

QC 20150417

Tillgänglig från: 2015-04-08 Skapad: 2015-04-08 Senast uppdaterad: 2020-03-09Bibliografiskt granskad
4. Detection and quantitative estimation of spurious double stranded DNA formation during reverse transcription in bateria using tagRNA-seq
Öppna denna publikation i ny flik eller fönster >>Detection and quantitative estimation of spurious double stranded DNA formation during reverse transcription in bateria using tagRNA-seq
2015 (Engelska)Ingår i: RNA Biology, ISSN 1547-6286, E-ISSN 1555-8584Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Standard RNA-seq has a well know tendency to generate "ghost" antisense reads due to formation of spurious second strand cDNA in the sequencing process. We recently reported on a novel variant of RNA-seq coined "tagRNA-seq" introduced for the purpose of distinguishing primary from processed transcripts in bacteria. Incidentally, the additional information provided by the tag is also very suitable for detection of true anti-sense RNA transcripts and quantification of spurious antisense signals in a sample. We briefly explain how to perform such a detection and illustrate on previously published datasets.

Ort, förlag, år, upplaga, sidor
Taylor & Francis, 2015
Nyckelord
tagRNA-seq, spurious second strand cDNA, antisense RNA, complementary DNA, transcriptome, transcript discovery
Nationell ämneskategori
Bioinformatik och systembiologi Mikrobiologi
Forskningsämne
Bioteknologi; Biologisk fysik
Identifikatorer
urn:nbn:se:kth:diva-171378 (URN)10.1080/15476286.2015.1071010 (DOI)000361473300018 ()26177062 (PubMedID)2-s2.0-84949803433 (Scopus ID)
Anmärkning

QC 20150811

Tillgänglig från: 2015-07-29 Skapad: 2015-07-29 Senast uppdaterad: 2020-03-09Bibliografiskt granskad
5. An observation of circular RNAs in bacterial RNA-seq data.
Öppna denna publikation i ny flik eller fönster >>An observation of circular RNAs in bacterial RNA-seq data.
2015 (Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained from Enterococcus faecalis and Escherichia coli in a way similar to previous studies performed on eukaryotes. We report observations of circRNAs in RNA-seq data that are reproducible across multiple experiments performed with different protocols or growth conditions.

Nyckelord
Circular RNA, RNA-seq
Nationell ämneskategori
Bioinformatik och systembiologi
Forskningsämne
Biologisk fysik; Datalogi
Identifikatorer
urn:nbn:se:kth:diva-173215 (URN)
Anmärkning

QS 2015

Tillgänglig från: 2015-09-13 Skapad: 2015-09-07 Senast uppdaterad: 2016-02-02Bibliografiskt granskad
6. Landscape of RNA polyadenylation in E. coli
Öppna denna publikation i ny flik eller fönster >>Landscape of RNA polyadenylation in E. coli
Visa övriga...
2016 (Engelska)Ingår i: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Polyadenylation is involved in degradation and quality control of bacterial RNAs. We used a combination of 5’-tagRACE and RNA-seq to analyse the total RNA content from wild-type strain and from mutant deficient for poly(A)polymerase. We determined that 157 mRNAs were affected as well as non-coding transcripts, up- and downregulated in the mutant when compared to the wild-type strain. Antisense RNAs were also detected and differentially affected by polyadenylation.

Our results clearly reveal a correlation between the RNA folding energy and the requirement of polyadenylation to achieve the RNA decay. A new algorithm was developed to detect in both strains posttranscriptional modifications based on unmappable 3’-ends to analyse their position and composition. Therefore, any RNA 3'-end can be polyadenylated addressing them to the exoribonucleolytic machinery which is essential to degrade structured RNAs. Importantly, poly(A)polymerase was also upregulating the expression of genes related with the entire FliA regulon and numerous membrane transporters while downregulating the expression of the antigen 43 (flu), numerous sRNAs, antisense transcripts, REP sequences with the accumulation of numerous RNA fragments resulting from the processing of entire transcripts. Altogether we show here that polyadenylation has a broader spectrum of action than was suspected until now.

Ort, förlag, år, upplaga, sidor
Oxford University Press, 2016
Nyckelord
Polyadenylation, degradation, poly(A)polymerase, pcnB deficient mutant
Nationell ämneskategori
Mikrobiologi Bioinformatik (beräkningsbiologi) Genetik
Forskningsämne
Biologisk fysik
Identifikatorer
urn:nbn:se:kth:diva-173328 (URN)10.1093/nar/gkw894 (DOI)000397286600048 ()28426097 (PubMedID)2-s2.0-85018357344 (Scopus ID)
Anmärkning

QC 20170119

Tillgänglig från: 2015-09-09 Skapad: 2015-09-09 Senast uppdaterad: 2020-03-09Bibliografiskt granskad
7. The bulk and the tail of minimal absent words in genome sequences
Öppna denna publikation i ny flik eller fönster >>The bulk and the tail of minimal absent words in genome sequences
2016 (Engelska)Ingår i: Physical Biology, ISSN 1478-3967, E-ISSN 1478-3975, Vol. 13, nr 2, artikel-id 026004Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. % In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the novel concept of a core of a minimal absent word, which are sequences present in the genome and closest to a given MAW. We show that in bacteria and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs (rRNAs). We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.

Ort, förlag, år, upplaga, sidor
Institute of Physics (IOP), 2016
Nyckelord
Minimal absent word, copy-mutation evolution model, random sequence
Nationell ämneskategori
Biofysik Evolutionsbiologi Genetik Bioinformatik (beräkningsbiologi) Fysik
Identifikatorer
urn:nbn:se:kth:diva-173501 (URN)10.1088/1478-3975/13/2/026004 (DOI)000376415400009 ()27043075 (PubMedID)2-s2.0-84969930876 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet, 621-2012-2982
Anmärkning

QC 20161102

Tillgänglig från: 2015-09-14 Skapad: 2015-09-13 Senast uppdaterad: 2018-01-11Bibliografiskt granskad

Open Access i DiVA

Main text(41177 kB)283 nedladdningar
Filinformation
Filnamn FULLTEXT02.pdfFilstorlek 41177 kBChecksumma SHA-512
0da7a36ab65d758d4ac247d04e9fb12ad45c1c215ac85b8f5b7cb6b911565caa585c1ab764838e8c5b5ec89417a19dc20a1110f93a295c10b77d60503fa307f6
Typ fulltextMimetyp application/pdf
Supplementary Material(833 kB)92 nedladdningar
Filinformation
Filnamn ATTACHMENT01.zipFilstorlek 833 kBChecksumma SHA-512
aaf0ff76cc774a699647fea0000a6ea4b14d33b57754949304948358443957230fd175bdf3cbd00dfa6c5f68cd5829e870ed3c7741ff922b5bcc48f64b6e612c
Typ attachmentMimetyp application/zip
Spikblad (english)(199 kB)10 nedladdningar
Filinformation
Filnamn SPIKBLAD01.pdfFilstorlek 199 kBChecksumma SHA-512
749603b4babdc78bdec7fd199c8b4ad70304581adb3b89b50f2aa0addaa0cbbb9784043f834b9595e59ba30e8b207ad1373ea3cb9e9ed3c3b53cd244ecc04bac
Typ spikbladMimetyp application/pdf
Spikblad (swedish)(199 kB)21 nedladdningar
Filinformation
Filnamn SPIKBLAD02.pdfFilstorlek 199 kBChecksumma SHA-512
179c21ae6b86d5653c66129f662786a0b075a33e6cd1d3f95c454a150537d62cfc8917852559267e44818a76477e8a22908b190f17c8674828b2ae25b28d5fc4
Typ spikbladMimetyp application/pdf
Errata(132 kB)21 nedladdningar
Filinformation
Filnamn FULLTEXT03.pdfFilstorlek 132 kBChecksumma SHA-512
439dfe27686604de7c0d3d9cfb50af7c6695643959c33ae13fbdde12b406b14238dcc1a00d1a6fccf4e91c92b4c5ca479b3335e7780d7f023271b7f73e6ff0c6
Typ errataMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Innocenti, Nicolas
Av organisationen
Beräkningsbiologi, CB
Bioinformatik (beräkningsbiologi)MikrobiologiAnnan biologiGenetik

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 304 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 874 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf