kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (5 of 5) Show all publications
Xu, X. (2024). Algorithms and Models in Nanopore DNA Sequencing: Advanced Decoding and Modeling with Hierarchical Hidden Markov Models. (Doctoral dissertation). KTH Royal Institute of Technology
Open this publication in new window or tab >>Algorithms and Models in Nanopore DNA Sequencing: Advanced Decoding and Modeling with Hierarchical Hidden Markov Models
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Within less than four decades, the nanopore sequencing technology accelerated from an implausible idea scribbled on a notebook page to one of the decisive contributors to the complete sequence of the human genome. Its rapid evolution, particularly in recent years, is driven not only by its inherent innovation of the nanopores but also by synergistic advancements in complementary fields, such as GPU acceleration and deep neural networks, as well as cross-disciplinary influences from domains like speech recognition. However, during this rapid advancement, certain methods within nanopore sequencing remain relatively unexplored. This oversight has the potential to create bottlenecks in the technology's further development.

In this thesis work, we delve into these uncharted areas, seeking to fill critical gaps and branch the technology into new frontiers. Our objective is to unleash its potential, enabling further breakthroughs in genomic research and beyond.

Through our research, we have developed two novel algorithms and two innovative models tailored to address these under-explored aspects of nanopore sequencing. The two algorithms, the GMBS and the LFBS, both belonging to the MBS group, offer innovative solutions to the challenging decoding problems inherent in HHMMs. They are two distinct variations tailored to different scenarios. While the GMBS is specifically suited for decoding lengthy sequences, such as those encountered in long-read basecalling, the LFBS is optimized for parallel programming and excels in processing short-length sequences. 

The two innovative models developed in this research, each leveraging variations of HHMMs and employing an end-to-end approach, exhibit distinguished structures. The first model, a hybrid of EDHMM and DNN, showcases the effectiveness of integrating both knowledge-driven and data-driven techniques. In contrast, the second model, a custom-designed Helicase HMM, draws inspiration from pioneering studies on motor proteins found in sequencing devices. With its elaborate hierarchical state architecture boasting over five million emission states, this model offers a comprehensive feature space comparable to its predecessor. 

Abstract [sv]

Inom mindre än fyra decennier har nanopore-sekvenseringsteknologin accelererat från en otrolig idé som skissades i en antckningsbok till en avgörande teknologi som bidragit till den kompletta sekvensensieringen av det mänskliga genomet. Denna snabba utveckling, särskilt under de senaste åren, drivs inte bara av innovation kring nanoporer utan också av synergistiska framsteg inom kompletterande områden, såsom GPU-acceleration och djupa neurala nätverk, samt tvärvetenskapligt inflytande från domäner som taligenkänning. Under denna snabba utveckling har do vissa metoder inom nanopore sekvensering förblivit relativt outforskade. Detta förbiseende riskerar att skapa flaskhalsar i teknologins vidareutveckling.

I denna avhandling utforskar vi dessa outforskade områden, i syfte att fylla kritiska luckor och utveckla tekniken mot nya fronter. Vårt mål är att frigöra dess fulla potential och möjliggöra ytterligare genombrott inom genomisk forskning och därutöver.

Som del av vår forskning har vi utvecklat två nya algoritmer och två innovativa modeller anpassade för att adressera dessa underutredda aspekter av nanopore sekvensering. De två algoritmerna, GMBS och LFBS, som är instanser av det mer generella ramverket av MBS-algorithmer (\emph{eng.} marginalised beam search), erbjuder innovativa lösningar på de utmanande avkodningsproblemen som är inneboende i HHMM:er. De är två distinkta variationer anpassade för olika scenarier. Medan GMBS är speciellt lämpad för avkodning av långa sekvenser, såsom de som stöts på vid läsning av långa sekvenser, är LFBS optimerad för parallell programmering och utmärker sig i bearbetning av korta sekvenser.

De två innovativa modellerna som utvecklats i denna forskning, vilka båda utnyttjar variationer av HHMM:er och använder en ``end-to-end''-ansats, uppvisar distinkta strukturer. Den första modellen, en hybrid av EDHMM och DNN, visar effektiviteten av att integrera både kunskapsdrivna och datadrivna tekniker. I kontrast till detta, drar den andra modellen, en anpassad Helicase HMM, inspiration från pionjärstudier om motorproteiner som finns i sekvenseringsenheter. Med sin detaljerade hierarkiska tillståndsarkitektur med över fem miljoner emissivtillstånd, erbjuder denna modell en omfattande egenskapsrymd jämfört med sina föregångare.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2024. p. 147
Series
TRITA-EECS-AVL ; 2024:40
National Category
Bioinformatics (Computational Biology)
Research subject
Electrical Engineering
Identifiers
urn:nbn:se:kth:diva-346051 (URN)978-91-8040-914-8 (ISBN)
Public defence
2024-05-24, F3, Lindstedtsvägen 26, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20240502

Hybrid attendance: https://kth-se.zoom.us/j/66248893067

Available from: 2024-05-02 Created: 2024-05-01 Last updated: 2025-03-17Bibliographically approved
Xu, X. & Jaldén, J. (2024). Marginalized Beam Search Algorithms for Hierarchical HMMs. IEEE Transactions on Signal Processing, 72, 3013-3027
Open this publication in new window or tab >>Marginalized Beam Search Algorithms for Hierarchical HMMs
2024 (English)In: IEEE Transactions on Signal Processing, ISSN 1053-587X, E-ISSN 1941-0476, Vol. 72, p. 3013-3027Article in journal (Refereed) Published
Abstract [en]

Inferring a state sequence from a sequence of measurements is a fundamental problem in bioinformatics and natural language processing. The Viterbi and the Beam Search (BS) algorithms are popular inference methods, but they have limitations when applied to Hierarchical Hidden Markov Models (HHMMs), when the primary interest lies in the outer state sequence. The Viterbi algorithm can not infer outer states without inner states, while the BS algorithm requires marginalization over prohibitively large state spaces. We propose two new algorithms to overcome these limitations: the greedy marginalized BS algorithm and the local focus BS algorithm. We show that they approximate the most likely outer state sequence with higher performance than the Viterbi algorithm, and we evaluate the performance of these algorithms on an explicit duration HMM with simulated and real nanopore base calling data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Signal processing algorithms, Decoding, Hidden Markov models, Viterbi algorithm, Search problems, RNA, Complexity theory, Viterbi, beam search
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-350799 (URN)10.1109/TSP.2024.3415468 (DOI)001263372000003 ()2-s2.0-85196748675 (Scopus ID)
Note

QC 20240722

Available from: 2024-07-22 Created: 2024-07-22 Last updated: 2024-08-20Bibliographically approved
Hummelgren, L., Palmkvist, V., Stjerna, L., Xu, X., Jaldén, J. & Broman, D. (2024). Trellis: A Domain-Specific Language for Hidden Markov Models with Sparse Transitions. In: Laemmel, R Pereira, JA Mosses, PD (Ed.), PROCEEDINGS OF THE 17TH ACM SIGPLAN INTERNATIONAL CONFERENCE ON SOFTWARE LANGUAGE ENGINEERING, SLE 2024: . Paper presented at 17th ACM SIGPLAN International Conference on Software Language Engineering (SLE), OCT 20-21, 2024, Pasadena, CA (pp. 196-209). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Trellis: A Domain-Specific Language for Hidden Markov Models with Sparse Transitions
Show others...
2024 (English)In: PROCEEDINGS OF THE 17TH ACM SIGPLAN INTERNATIONAL CONFERENCE ON SOFTWARE LANGUAGE ENGINEERING, SLE 2024 / [ed] Laemmel, R Pereira, JA Mosses, PD, Association for Computing Machinery (ACM) , 2024, p. 196-209Conference paper, Published paper (Refereed)
Abstract [en]

Hidden Markov models (HMMs) are frequently used in areas such as speech recognition and bioinformatics. However, implementing HMM algorithms correctly and efficiently is time-consuming and error-prone. Specifically, using model-specific knowledge to improve performance, such as sparsity in the transition probability matrix, ties the implementation to a particular model, making it harder to modify. Previous work has introduced high-level frameworks for defining HMMs, thus lifting the burden of efficiently implementing HMM algorithms from the user. However, existing tools are ill-suited for sparse HMMs with many states. This paper introduces Trellis, a domain-specific language for succinctly defining sparse HMMs that use GPU acceleration to achieve high performance. We show that Trellis outperforms previous work and is on par with a hand-written CUDA kernel implementation for a particular sparse HMM.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
HiddenMarkovmodels, DSL, parallelization, GPU acceleration
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-357515 (URN)10.1145/3687997.3695641 (DOI)001344239100017 ()2-s2.0-85210805499 (Scopus ID)
Conference
17th ACM SIGPLAN International Conference on Software Language Engineering (SLE), OCT 20-21, 2024, Pasadena, CA
Note

Part of ISBN 979-8-4007-1180-0

QC 20241209

Available from: 2024-12-09 Created: 2024-12-09 Last updated: 2025-05-27Bibliographically approved
Xu, X., Bhalla, N., Ståhl, P. & Jaldén, J. (2023). Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network. BMC Bioinformatics, 24(1), Article ID 461.
Open this publication in new window or tab >>Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network
2023 (English)In: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 24, no 1, article id 461Article in journal (Refereed) Published
Abstract [en]

BackgroundBasecalling long DNA sequences is a crucial step in nanopore-based DNA sequencing protocols. In recent years, the CTC-RNN model has become the leading basecalling model, supplanting preceding hidden Markov models (HMMs) that relied on pre-segmenting ion current measurements. However, the CTC-RNN model operates independently of prior biological and physical insights.ResultsWe present a novel basecaller named Lokatt: explicit duration Markov model and residual-LSTM network. It leverages an explicit duration HMM (EDHMM) designed to model the nanopore sequencing processes. Trained on a newly generated library with methylation-free Ecoli samples and MinION R9.4.1 chemistry, the Lokatt basecaller achieves basecalling performances with a median single read identity score of 0.930, a genome coverage ratio of 99.750%, on par with existing state-of-the-art structure when trained on the same datasets.ConclusionOur research underlines the potential of incorporating prior knowledge into the basecalling processes, particularly through integrating HMMs and recurrent neural networks. The Lokatt basecaller showcases the efficacy of a hybrid approach, emphasizing its capacity to achieve high-quality basecalling performance while accommodating the nuances of nanopore sequencing. These outcomes pave the way for advanced basecalling methodologies, with potential implications for enhancing the accuracy and efficiency of nanopore-based DNA sequencing protocols.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Basecalling, HMM, LSTM, Nanopore sequencing
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-341527 (URN)10.1186/s12859-023-05580-x (DOI)001115621100003 ()38062356 (PubMedID)2-s2.0-85178887529 (Scopus ID)
Note

QC 20231222

Available from: 2023-12-22 Created: 2023-12-22 Last updated: 2024-05-01Bibliographically approved
Xu, X. & Jaldén, J. (2023). Marginalized Beam Search Algorithms for Hierarchical HMMs.
Open this publication in new window or tab >>Marginalized Beam Search Algorithms for Hierarchical HMMs
2023 (English)Other (Other academic)
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-346020 (URN)
Available from: 2024-04-29 Created: 2024-04-29 Last updated: 2024-05-01Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1850-0946

Search in DiVA

Show all publications