Change search
ReferencesLink to record
Permanent link

Direct link
ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition
KTH, School of Electrical Engineering (EES), Communication Theory.ORCID iD: 0000-0003-2638-6047
Show others and affiliations
2015 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 10, no 10, e0140644Article in journal (Refereed) Published
Abstract [en]

Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

Place, publisher, year, edition, pages
PUBLIC LIBRARY SCIENCE , 2015. Vol. 10, no 10, e0140644
Keyword [en]
Split Vector Quantization, LSF Parameters, Sequences, Megan
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-176956DOI: 10.1371/journal.pone.0140644ISI: 000363309200025PubMedID: 26496191ScopusID: 2-s2.0-84949460421OAI: oai:DiVA.org:kth-176956DiVA: diva2:883073
Note

QC 20151216

Available from: 2015-12-16 Created: 2015-11-13 Last updated: 2015-12-16Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMedScopusimplementation of the method in the Julia programming languageMatlab implementation

Search in DiVA

By author/editor
Chatterjee, Saikat
By organisation
Communication Theory
In the same journal
PLoS ONE
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 23 hits
ReferencesLink to record
Permanent link

Direct link