Change search
ReferencesLink to record
Permanent link

Direct link
Classification of DNA sequences using Bloom filters
KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.
Show others and affiliations
2010 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1460-2059, Vol. 26, no 13, 1595-1600 p.Article in journal (Refereed) Published
Abstract [en]

Motivation: New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the 'novel' sequences in a complex dataset that are of interest and the superfluous sequences need to be removed. Results: A novel algorithm, fast and accurate classification of sequences (FACSs), is introduced that can accurately and rapidly classify sequences as belonging or not belonging to a reference sequence. FACS was first optimized and validated using a synthetic metagenome dataset. An experimental metagenome dataset was then used to show that FACS achieves comparable accuracy as BLAT and SSAHA2 but is at least 21 times faster in classifying sequences.

Place, publisher, year, edition, pages
2010. Vol. 26, no 13, 1595-1600 p.
National Category
Biochemistry and Molecular Biology
URN: urn:nbn:se:kth:diva-27282DOI: 10.1093/bioinformatics/btq230ISI: 000278967500003ScopusID: 2-s2.0-77954187316OAI: diva2:377544
QC 20101214Available from: 2010-12-14 Created: 2010-12-09 Last updated: 2011-11-15Bibliographically approved
In thesis
1. Enabling massive genomic and transcriptomic analysis
Open this publication in new window or tab >>Enabling massive genomic and transcriptomic analysis
2011 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In recent years there have been tremendous advances in our ability to rapidly and cost-effectively sequence DNA. This has revolutionized the fields of genetics and biology, leading to a deeper understanding of the molecular events in life processes. The rapid advances have enormously expanded sequencing opportunities and applications, but also imposed heavy strains on steps prior to sequencing, as well as the subsequent handling and analysis of the massive amounts of sequence data that are generated, in order to exploit the full capacity of these novel platforms. The work presented in this thesis (based on six appended papers) has contributed to balancing the sequencing process by developing techniques to accelerate the rate-limiting steps prior to sequencing, facilitating sequence data analysis and applying the novel techniques to address biological questions.


Papers I and II describe techniques to eliminate expensive and time-consuming preparatory steps through automating library preparation procedures prior to sequencing. The automated procedures were benchmarked against standard manual procedures and were found to substantially increase throughput while maintaining high reproducibility. In Paper III, a novel algorithm for fast classification of sequences in complex datasets is described. The algorithm was first optimized and validated using a synthetic metagenome dataset and then shown to enable faster analysis of an experimental metagenome dataset than conventional long-read aligners, with similar accuracy. Paper IV, presents an investigation of the molecular effects on the p53 gene of exposing human skin to sunlight during the course of a summer holiday. There was evidence of previously accumulated persistent p53 mutations in 14% of all epidermal cells. Most of these mutations are likely to be passenger events, as the affected cell compartments showed no apparent growth advantage. An annual rate of 35,000 novel sun-induced persistent p53 mutations was estimated to occur in sun-exposed skin of a human individual.  Paper V, assesses the effect of using RNA obtained from whole cell extracts (total RNA) or cytoplasmic RNA on quantifying transcripts detected in subsequent analysis. Overall, more differentially detected genes were identified when using the cytoplasmic RNA. The major reason for this is related to the reduced complexity of cytoplasmic RNA, but also apparently due (at least partly) to the nuclear retention of transcripts with long, structured 5’- and 3’-untranslated regions or long protein coding sequences. The last paper, VI, describes whole-genome sequencing of a large, consanguineous family with a history of Leber hereditary optic neuropathy (LHON) on the maternal side. The analysis identified new candidate genes, which could be important in the aetiology of LHON. However, these candidates require further validation before any firm conclusions can be drawn regarding their contribution to the manifestation of LHON.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2011. 45 p.
Trita-BIO-Report, ISSN 1654-2312 ; 2011:24
DNA, RNA, sequencing, massively parallel sequencing, alignment, assembly, single nucleotide polymorphism, LHON
National Category
Biological Sciences
urn:nbn:se:kth:diva-45957 (URN)978-91-7501-164-6 (ISBN)
Public defence
2011-12-02, Petrén‐salen, Nobels väg 12B, Karolinska Institute Campus Solna, Stockholm, 13:00 (English)
QC 20111115Available from: 2011-11-15 Created: 2011-11-01 Last updated: 2011-11-15Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Stranneheim, HenrikArvestad, LarsLundeberg, Joakim
By organisation
Gene TechnologyScience for Life Laboratory, SciLifeLabComputational Biology, CB
In the same journal
Biochemistry and Molecular Biology

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 49 hits
ReferencesLink to record
Permanent link

Direct link