Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition
KTH, Skolan för elektro- och systemteknik (EES), Kommunikationsteori.ORCID-id: 0000-0003-2638-6047
Dept of Mathematics, Oregon State University, Corvallis, USA.
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB.
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB. (Computational Biological Physics, CBP)
Visa övriga samt affilieringar
2014 (Engelska)Ingår i: Bioinformatics, ISSN 1460-2059, Vol. 30, nr 17, s. 2423-2431Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment.

Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.

Availability and implementation: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site.

Ort, förlag, år, upplaga, sidor
Oxford University Press, 2014. Vol. 30, nr 17, s. 2423-2431
Nyckelord [en]
bacterial community composition, sparsity, metagenomics
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Datalogi
Identifikatorer
URN: urn:nbn:se:kth:diva-152814DOI: 10.1093/bioinformatics/btu320ISI: 000342912400046Scopus ID: 2-s2.0-84907029456OAI: oai:DiVA.org:kth-152814DiVA, id: diva2:751624
Forskningsfinansiär
Vetenskapsrådet
Anmärkning

QC 20141023

Tillgänglig från: 2014-10-01 Skapad: 2014-10-01 Senast uppdaterad: 2018-01-11Bibliografiskt granskad
Ingår i avhandling
1. Data Analysis and Next Generation Sequencing : Applications in Microbiology.
Öppna denna publikation i ny flik eller fönster >>Data Analysis and Next Generation Sequencing : Applications in Microbiology.
2015 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Next Generation Sequencing (NGS) is a new technology that has revolutionized the way we study living organisms. Where previously only a few genes could be studied at a time through targeted direct probing, NGS offers the possibility to perform measurements for a whole genome at once. The drawback is that the amount of data generated in the process is large and extracting useful information from it requires new methods to process and analyze it.

The main contribution of this thesis is the development of a novel experimental method coined tagRNA-seq, combining 5’tagRACE, a previously developed technique, with RNA-sequencing technology. Briefly, tagRNA-seq makes it possible to identify the 5’ ends of RNAs in bacteria and directly probe for their type, primary or processed, by ligating short RNA sequences, the tags, to the beginnings of RNA molecules. We used the method to directly probe for transcription start and processing sites in two bacterial species, Escherichiacoli and Enterococcus faecalis. It was also used to study polyadenylation in E. coli, where the ability to identify processed RNA molecules proved to be useful to separate direct and indirect regulatory effects of this mechanism. We also demonstrate how data from tagRNA-seq experiments can be used to increase confidence on the discovery of anti-sense transcripts in bacteria. Analyses of RNA-seq data obtained in the context of these experiments revealed subtle artifacts in the coverage signal towards gene ends, that we were able to explain and quantify based Kolmogorov’s broken stick model. We also discovered evidences for circularization of a few RNA transcripts, both in our own data sets and publicly available data.

Designing the tags used in tagRNA-seq led us to the problem of words absent from a text. We focus on a particular subset of these, the minimal absent words (MAWs), and develop a theory providing a complete description of their size distribution in random text. We also show that MAWs in genomes from viruses and living organisms almost always exhibit a behavior different from random texts in the tail of the distribution, and that MAWs from this tail are closely related to sequences present in the genome that preferentially appear in regions with important regulatory functions.

Finally, and independently from tagRNA-seq, we propose a new approach to the problem of bacterial community reconstruction in metagenomic, based on techniques from compressed sensing. We provide a novel algorithm competing with state-of-the-art techniques in the field.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2015. s. xviii, 154
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 2015:15
Nyckelord
RNA-seq, tagRNA-seq, primary and processed RNA, Enterococcus faecalis, Complex transcription, Metagenomics, 5'tagRACE, minimal absent words, compressed sensing, metagenomics, bacterial community reconstruction
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Mikrobiologi Annan biologi Genetik
Forskningsämne
Biologisk fysik
Identifikatorer
urn:nbn:se:kth:diva-173219 (URN)978-91-7595-699-2 (ISBN)
Disputation
2015-10-30, FA32, Roslagstullsbacken 21, Stockholm, 14:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20150930

Tillgänglig från: 2015-09-30 Skapad: 2015-09-07 Senast uppdaterad: 2018-01-11Bibliografiskt granskad

Open Access i DiVA

SEK paper(672 kB)97 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 672 kBChecksumma SHA-512
300724eff0d99609ac66fb88777eb5e2c9755965c99a9145fcaa94055eb61032eb1d14a62e83d26c2222067689f704de4870e5ad379c6ae39cdb167ca5f06edf
Typ fulltextMimetyp application/pdf
SEK Code(3609 kB)0 nedladdningar
Filinformation
Filnamn SOFTWARE01.zipFilstorlek 3609 kBChecksumma SHA-512
0c9333cd8d836bb1aeddbdc61b9c76bcb893585342e304a2a41727592609ab0e3f15f08afedbe82aa1183617c4dfe77128592a154e7623295571bd9ea45ca596
Typ softwareMimetyp application/zip

Övriga länkar

Förlagets fulltextScopusPublisher's website

Personposter BETA

Chatterjee, SaikatSkoglund, MikaelK. Rasmussen, Lars

Sök vidare i DiVA

Av författaren/redaktören
Chatterjee, SaikatDong, SiyuanInnocenti, NicolasVehkaperä, MikkoSkoglund, MikaelK. Rasmussen, LarsAurell, Erik
Av organisationen
KommunikationsteoriBeräkningsbiologi, CBACCESS Linnaeus Centre
Bioinformatik (beräkningsbiologi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 97 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 238 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf