Accurate and fast taxonomic profiling of microbial communities
2015 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
With the advent of next generation sequencing there has been an explosion
of the size of data that needs to be processed, where next generation
sequencing yields basepairs of DNA in the millions. The rate at
which the size of data increases supersedes Moores law therefore there is
a huge demand for methods to nd meaningful labels of sequenced data.
Studies of microbial diversity of a sample is one such challenge in the eld
of metagenomics. Finding the distribution of a bacterial community has
many uses for example, obesity control. Existing methods often resort to
read-by-read classication which can take several days of computing time
in a regular desktop environment, excluding genomic scientists without
access to huge clusters of computational units.
By using sparsity enforcing methods from the general sparse signal processing
eld (such as compressed sensing), solutions have been found to
the bacterial community composition estimation problem by a simultaneous
assignment of all sample reads to a pre-processed reference database.
The inference task is reduced to a general statistical model based on
kernel density estimation techniques that are solved by existing convex
optimization tools. The objective is to o er a reasonably fast community
composition estimation method. This report proposes, clustering as
a means of aggregating data to improve existing techniques run-time and
biological delity. Use of convex optimization tools to increase the accuracy
of mixture model parameters are also explored and tested. The
work is concluded by experimentation on proposed improvements with
satisfactory results.
The use of Dirichlet mixtures is explored as a parametric model of
the sample distribution where it is deemed that the Dirichlet is a good
choice for aggregation of k-mer feature vectors but the use of Expectation
Maximization is unt for parameter estimation of bacterial 16s rRNA
samples.
Finally, a semi-supervised learning method found on distance based
classication of taxa has been implemented and tested on real biological
data with high biological delity.
Abstract [sv]
Nya tekniker inom DNA-sekvensering har givit upphov till en explosion
pa data som nns att tillga. Nasta generations DNA-sekvensering
generar baspar som stracker sig i miljonerna och mangden data okas i en
exponentiell takt, vilket ar varfor det nns ett stort behov av ny skalbar
metodik som kan analysera kvantitiv data for att fa ut relevant information.
Den bakteriella artfordelning av ett provror ar en sadan problemst
allning inom meta-genomik, vilket har era tillampningsomraden
som exempelvis, studier av fettma. I dagslaget sa ar den vanligaste metoden
for att fa ut artfordelningen genom att klassiera DNA-strangarna av
bakterierna, vilket ar en tidskravande losning som kan ta upp emot ett
dygn for att processera data med hog upplosning. En snabb och tillforlitlig
losning skulle darfor tillata er forskare att ta del av nasta generations
sekvensering och analysera dess data som i sin tur skulle ge upphov till
mer innovation inom omradet.
Alternativa losningar med inspiration fran signalbehandlig har hittats
som nyttjar problemestallningens glesa natur genom anvandning av Compressed
Sensing. Svar hittas genom att simultant tilldela strangar till en
for-processerad referensdatabas. Problemstallningen har forenklats till en
statistisk modell av provror med ickeparametrisk estimering for att implicit
fa ut fordelningen av bakteriearter med hjalp av konvex optimering.
Denna rapport foreslar anvandningen av klustrering for aggregering
av data for att forbattra tillforlitligheten av svaren och minska tiden for
berakning av dessa. Anvandningen av parametriska modeller, Dirichlet
fordelningen, har utforskats dar rapporten har kommit fram till att antaganden
for lampligheten av denna som ett medel att aggregera k-mer vektorer
~Ar rimliga men att parameterestimeringen med Expectation Maximization
ej fungerar val i samband med Dirichlet och en omskrivning
av parametern skulle behovas i vektorrymden som spans av 16S rRNA
genen.
Slutligen sa har distansbaserad tilldelning av bakterier testats pa data
fran verklig biologisk kontext med valdigt hog noggranhet.
ii
Place, publisher, year, edition, pages
2015. , p. 50
Series
EES Examensarbete / Master Thesis ; XR-EE-KT 2015:001
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-162919OAI: oai:DiVA.org:kth-162919DiVA, id: diva2:798057
Educational program
Master of Science - Wireless Systems
Presentation
2015-02-23, SIP conference room, Osquldas väg 10, Floor 3, Stockholm, 15:25 (English)
Supervisors
Examiners
2015-03-312015-03-252022-06-23Bibliographically approved