Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition
KTH, School of Electrical Engineering (EES), Communication Theory.ORCID iD: 0000-0003-2638-6047
Show others and affiliations
2015 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 10, no 10, e0140644Article in journal (Refereed) Published
Abstract [en]

Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

Place, publisher, year, edition, pages
PUBLIC LIBRARY SCIENCE , 2015. Vol. 10, no 10, e0140644
Keyword [en]
Split Vector Quantization, LSF Parameters, Sequences, Megan
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-176956DOI: 10.1371/journal.pone.0140644ISI: 000363309200025PubMedID: 26496191Scopus ID: 2-s2.0-84949460421OAI: oai:DiVA.org:kth-176956DiVA: diva2:883073
Note

QC 20151216

Available from: 2015-12-16 Created: 2015-11-13 Last updated: 2015-12-16Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMedScopusimplementation of the method in the Julia programming languageMatlab implementation

Search in DiVA

By author/editor
Chatterjee, Saikat
By organisation
Communication Theory
In the same journal
PLoS ONE
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar

Altmetric score

Total: 82 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf