Change search
ReferencesLink to record
Permanent link

Direct link
Lognormality and oscillations in the coverage of high-throughput transcriptomic data towards gene ends
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. (Computational Biological Physics, CBP)
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. (Computational Biological Physics, CBP)
2013 (English)In: Journal of Statistical Mechanics: Theory and Experiment, ISSN 1742-5468, Vol. 2013, no 10, P10013- p.Article in journal (Refereed) Published
Abstract [en]

High-throughput transcriptomics experiments have reached the stage where the count of the number of reads alignable to a given position can be treated as an almost-continuous signal. This allows us to ask questions of biophysical/biotechnical nature, but which may still have biological implications. Here we show that when sequencing RNA fragments from one end, as is the case on most platforms, an oscillation in the read count is observed at the other end. We further show that these oscillations can be well described by Kolmogorov's 1941 broken stick model. We investigate how the model can be used to improve predictions of gene ends (3' transcript ends), but conclude that with present data the improvement is only marginal. The results highlight subtle effects in high-throughput transcriptomics experiments which do not have a biological origin, but which may still be used to obtain biological information.

Place, publisher, year, edition, pages
Institute of Physics (IOP), 2013. Vol. 2013, no 10, P10013- p.
Keyword [en]
Modelling, Artefacts, RNAseq, Log-normal distribution, Broken stick, 3'end of transcripts
National Category
Bioinformatics and Systems Biology Other Physics Topics
Research subject
SRA - Molecular Bioscience
URN: urn:nbn:se:kth:diva-136077DOI: 10.1088/1742-5468/2013/10/P10013ISI: 000326869000014ScopusID: 2-s2.0-84888614314OAI: diva2:670501

QC 20131220

Available from: 2013-12-03 Created: 2013-12-03 Last updated: 2015-09-30Bibliographically approved
In thesis
1. Data Analysis and Next Generation Sequencing : Applications in Microbiology.
Open this publication in new window or tab >>Data Analysis and Next Generation Sequencing : Applications in Microbiology.
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it.

The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data.

Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions.

Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. xviii, 154 p.
TRITA-CSC-A, ISSN 1653-5723 ; 2015:15
RNA-seq, tagRNA-seq, primary and processed RNA, Enterococcus faecalis, Complex transcription, Metagenomics, 5'tagRACE, minimal absent words, compressed sensing, metagenomics, bacterial community reconstruction
National Category
Bioinformatics (Computational Biology) Microbiology Other Biological Topics Genetics
Research subject
Biological Physics
urn:nbn:se:kth:diva-173219 (URN)978-91-7595-699-2 (ISBN)
Public defence
2015-10-30, FA32, Roslagstullsbacken 21, Stockholm, 14:00 (English)

QC 20150930

Available from: 2015-09-30 Created: 2015-09-07 Last updated: 2015-11-06Bibliographically approved

Open Access in DiVA

brokenstick(2036 kB)119 downloads
File information
File name FULLTEXT01.pdfFile size 2036 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusIOP Science pageFulltext in arXiv

Search in DiVA

By author/editor
Innocenti, NicolasAurell, Erik
By organisation
Computational Biology, CB
In the same journal
Journal of Statistical Mechanics: Theory and Experiment
Bioinformatics and Systems BiologyOther Physics Topics

Search outside of DiVA

GoogleGoogle Scholar
Total: 119 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 37 hits
ReferencesLink to record
Permanent link

Direct link