Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Data Analysis and Next Generation Sequencing : Applications in Microbiology.
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. (Computational Biological Physics, CBP)
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it.

The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data.

Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions.

Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. , xviii, 154 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2015:15
Keyword [en]
RNA-seq, tagRNA-seq, primary and processed RNA, Enterococcus faecalis, Complex transcription, Metagenomics, 5'tagRACE, minimal absent words, compressed sensing, metagenomics, bacterial community reconstruction
National Category
Bioinformatics (Computational Biology) Microbiology Other Biological Topics Genetics
Research subject
Biological Physics
Identifiers
URN: urn:nbn:se:kth:diva-173219ISBN: 978-91-7595-699-2 (print)OAI: oai:DiVA.org:kth-173219DiVA: diva2:854436
Public defence
2015-10-30, FA32, Roslagstullsbacken 21, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20150930

Available from: 2015-09-30 Created: 2015-09-07 Last updated: 2015-11-06Bibliographically approved
List of papers
1. Lognormality and oscillations in the coverage of high-throughput transcriptomic data towards gene ends
Open this publication in new window or tab >>Lognormality and oscillations in the coverage of high-throughput transcriptomic data towards gene ends
2013 (English)In: Journal of Statistical Mechanics: Theory and Experiment, ISSN 1742-5468, E-ISSN 1742-5468, Vol. 2013, no 10, P10013- p.Article in journal (Refereed) Published
Abstract [en]

High-throughput transcriptomics experiments have reached the stage where the count of the number of reads alignable to a given position can be treated as an almost-continuous signal. This allows us to ask questions of biophysical/biotechnical nature, but which may still have biological implications. Here we show that when sequencing RNA fragments from one end, as is the case on most platforms, an oscillation in the read count is observed at the other end. We further show that these oscillations can be well described by Kolmogorov's 1941 broken stick model. We investigate how the model can be used to improve predictions of gene ends (3' transcript ends), but conclude that with present data the improvement is only marginal. The results highlight subtle effects in high-throughput transcriptomics experiments which do not have a biological origin, but which may still be used to obtain biological information.

Place, publisher, year, edition, pages
Institute of Physics (IOP), 2013
Keyword
Modelling, Artefacts, RNAseq, Log-normal distribution, Broken stick, 3'end of transcripts
National Category
Bioinformatics and Systems Biology Other Physics Topics
Research subject
SRA - Molecular Bioscience
Identifiers
urn:nbn:se:kth:diva-136077 (URN)10.1088/1742-5468/2013/10/P10013 (DOI)000326869000014 ()2-s2.0-84888614314 (Scopus ID)
Note

QC 20131220

Available from: 2013-12-03 Created: 2013-12-03 Last updated: 2017-12-06Bibliographically approved
2. SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition
Open this publication in new window or tab >>SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition
Show others...
2014 (English)In: Bioinformatics, ISSN 1460-2059, Vol. 30, no 17, 2423-2431 p.Article in journal (Refereed) Published
Abstract [en]

Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment.

Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.

Availability and implementation: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site.

Place, publisher, year, edition, pages
Oxford University Press, 2014
Keyword
bacterial community composition, sparsity, metagenomics
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-152814 (URN)10.1093/bioinformatics/btu320 (DOI)000342912400046 ()2-s2.0-84907029456 (Scopus ID)
Funder
Swedish Research Council
Note

QC 20141023

Available from: 2014-10-01 Created: 2014-10-01 Last updated: 2015-09-30Bibliographically approved
3. Whole-genome mapping of 5′ RNA ends in bacteria by tagged sequencing: a comprehensive view in Enterococcus faecalis
Open this publication in new window or tab >>Whole-genome mapping of 5′ RNA ends in bacteria by tagged sequencing: a comprehensive view in Enterococcus faecalis
Show others...
2015 (English)In: RNA, ISSN 1355-8382Article in journal (Refereed) Published
Abstract [en]

Enterococcus faecalis is the third cause of nosocomial infections. To obtain the first snapshot of transcriptional organizations in this bacterium, we used a modified RNA-seq approach enabling to discriminate primary from processed 5' RNA ends. We also validated our approach by confirming known features in Escherichia coli. We mapped 559 transcription start sites (TSSs) and 352 processing sites (PSSs) in E. faecalis. A blind motif search retrieved canonical features of SigA-and SigN-dependent promoters preceding transcription start sites mapped. We discovered 85 novel putative regulatory RNAs, small-and antisense RNAs, and 72 transcriptional antisense organizations. Presented data constitute a significant insight into bacterial RNA landscapes and a step toward the inference of regulatory processes at transcriptional and post-transcriptional levels in a comprehensive manner.

Place, publisher, year, edition, pages
RNA Society, 2015
Keyword
primary RNA, processed RNA, promoter, RNA degradation, Enterococcus faecalis
National Category
Bioinformatics and Systems Biology Microbiology Genetics
Research subject
Biological Physics
Identifiers
urn:nbn:se:kth:diva-163570 (URN)10.1261/rna.048470.114 (DOI)000353068400022 ()2-s2.0-84928006918 (Scopus ID)
Funder
Swedish Research Council, 621-2012-2982
Note

QC 20150417

Available from: 2015-04-08 Created: 2015-04-08 Last updated: 2015-09-30Bibliographically approved
4. Detection and quantitative estimation of spurious double stranded DNA formation during reverse transcription in bateria using tagRNA-seq
Open this publication in new window or tab >>Detection and quantitative estimation of spurious double stranded DNA formation during reverse transcription in bateria using tagRNA-seq
2015 (English)In: RNA Biology, ISSN 1547-6286, E-ISSN 1555-8584Article in journal (Refereed) Published
Abstract [en]

Standard RNA-seq has a well know tendency to generate "ghost" antisense reads due to formation of spurious second strand cDNA in the sequencing process. We recently reported on a novel variant of RNA-seq coined "tagRNA-seq" introduced for the purpose of distinguishing primary from processed transcripts in bacteria. Incidentally, the additional information provided by the tag is also very suitable for detection of true anti-sense RNA transcripts and quantification of spurious antisense signals in a sample. We briefly explain how to perform such a detection and illustrate on previously published datasets.

Place, publisher, year, edition, pages
Taylor & Francis, 2015
Keyword
tagRNA-seq, spurious second strand cDNA, antisense RNA, complementary DNA, transcriptome, transcript discovery
National Category
Bioinformatics and Systems Biology Microbiology
Research subject
Biotechnology; Biological Physics
Identifiers
urn:nbn:se:kth:diva-171378 (URN)10.1080/15476286.2015.1071010 (DOI)000361473300018 ()2-s2.0-84949803433 (Scopus ID)
Note

QC 20150811

Available from: 2015-07-29 Created: 2015-07-29 Last updated: 2017-12-04Bibliographically approved
5. An observation of circular RNAs in bacterial RNA-seq data.
Open this publication in new window or tab >>An observation of circular RNAs in bacterial RNA-seq data.
2015 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained from Enterococcus faecalis and Escherichia coli in a way similar to previous studies performed on eukaryotes. We report observations of circRNAs in RNA-seq data that are reproducible across multiple experiments performed with different protocols or growth conditions.

Keyword
Circular RNA, RNA-seq
National Category
Bioinformatics and Systems Biology
Research subject
Biological Physics; Computer Science
Identifiers
urn:nbn:se:kth:diva-173215 (URN)
Note

QS 2015

Available from: 2015-09-13 Created: 2015-09-07 Last updated: 2016-02-02Bibliographically approved
6. Landscape of RNA polyadenylation in E. coli
Open this publication in new window or tab >>Landscape of RNA polyadenylation in E. coli
Show others...
2016 (English)In: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962Article in journal (Refereed) Published
Abstract [en]

Polyadenylation is involved in degradation and quality control of bacterial RNAs. We used a combination of 5’-tagRACE and RNA-seq to analyse the total RNA content from wild-type strain and from mutant deficient for poly(A)polymerase. We determined that 157 mRNAs were affected as well as non-coding transcripts, up- and downregulated in the mutant when compared to the wild-type strain. Antisense RNAs were also detected and differentially affected by polyadenylation.

Our results clearly reveal a correlation between the RNA folding energy and the requirement of polyadenylation to achieve the RNA decay. A new algorithm was developed to detect in both strains posttranscriptional modifications based on unmappable 3’-ends to analyse their position and composition. Therefore, any RNA 3'-end can be polyadenylated addressing them to the exoribonucleolytic machinery which is essential to degrade structured RNAs. Importantly, poly(A)polymerase was also upregulating the expression of genes related with the entire FliA regulon and numerous membrane transporters while downregulating the expression of the antigen 43 (flu), numerous sRNAs, antisense transcripts, REP sequences with the accumulation of numerous RNA fragments resulting from the processing of entire transcripts. Altogether we show here that polyadenylation has a broader spectrum of action than was suspected until now.

Place, publisher, year, edition, pages
Oxford University Press, 2016
Keyword
Polyadenylation, degradation, poly(A)polymerase, pcnB deficient mutant
National Category
Microbiology Bioinformatics (Computational Biology) Genetics
Research subject
Biological Physics
Identifiers
urn:nbn:se:kth:diva-173328 (URN)10.1093/nar/gkw894 (DOI)000397286600048 ()2-s2.0-85018357344 (Scopus ID)
Note

QC 20170119

Available from: 2015-09-09 Created: 2015-09-09 Last updated: 2017-10-23Bibliographically approved
7. The bulk and the tail of minimal absent words in genome sequences
Open this publication in new window or tab >>The bulk and the tail of minimal absent words in genome sequences
2016 (English)In: Physical Biology, ISSN 1478-3967, E-ISSN 1478-3975, Vol. 13, no 2, 026004Article in journal (Refereed) Published
Abstract [en]

Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. % In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the novel concept of a core of a minimal absent word, which are sequences present in the genome and closest to a given MAW. We show that in bacteria and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs (rRNAs). We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.

Place, publisher, year, edition, pages
Institute of Physics (IOP), 2016
Keyword
Minimal absent word, copy-mutation evolution model, random sequence
National Category
Biophysics Evolutionary Biology Genetics Bioinformatics (Computational Biology) Physical Sciences
Identifiers
urn:nbn:se:kth:diva-173501 (URN)10.1088/1478-3975/13/2/026004 (DOI)000376415400009 ()27043075 (PubMedID)2-s2.0-84969930876 (Scopus ID)
Funder
Swedish Research Council, 621-2012-2982
Note

QC 20161102

Available from: 2015-09-14 Created: 2015-09-13 Last updated: 2017-12-04Bibliographically approved

Open Access in DiVA

Main text(41177 kB)233 downloads
File information
File name FULLTEXT02.pdfFile size 41177 kBChecksum SHA-512
0da7a36ab65d758d4ac247d04e9fb12ad45c1c215ac85b8f5b7cb6b911565caa585c1ab764838e8c5b5ec89417a19dc20a1110f93a295c10b77d60503fa307f6
Type fulltextMimetype application/pdf
Supplementary Material(833 kB)17 downloads
File information
File name ATTACHMENT01.zipFile size 833 kBChecksum SHA-512
aaf0ff76cc774a699647fea0000a6ea4b14d33b57754949304948358443957230fd175bdf3cbd00dfa6c5f68cd5829e870ed3c7741ff922b5bcc48f64b6e612c
Type attachmentMimetype application/zip
Spikblad (english)(199 kB)8 downloads
File information
File name SPIKBLAD01.pdfFile size 199 kBChecksum SHA-512
749603b4babdc78bdec7fd199c8b4ad70304581adb3b89b50f2aa0addaa0cbbb9784043f834b9595e59ba30e8b207ad1373ea3cb9e9ed3c3b53cd244ecc04bac
Type spikbladMimetype application/pdf
Spikblad (swedish)(199 kB)19 downloads
File information
File name SPIKBLAD02.pdfFile size 199 kBChecksum SHA-512
179c21ae6b86d5653c66129f662786a0b075a33e6cd1d3f95c454a150537d62cfc8917852559267e44818a76477e8a22908b190f17c8674828b2ae25b28d5fc4
Type spikbladMimetype application/pdf
Errata(132 kB)18 downloads
File information
File name FULLTEXT03.pdfFile size 132 kBChecksum SHA-512
439dfe27686604de7c0d3d9cfb50af7c6695643959c33ae13fbdde12b406b14238dcc1a00d1a6fccf4e91c92b4c5ca479b3335e7780d7f023271b7f73e6ff0c6
Type errataMimetype application/pdf

Search in DiVA

By author/editor
Innocenti, Nicolas
By organisation
Computational Biology, CB
Bioinformatics (Computational Biology)MicrobiologyOther Biological TopicsGenetics

Search outside of DiVA

GoogleGoogle Scholar
Total: 251 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 563 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf