kth.sePublikationer KTH
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Algorithms and machine learning for single-molecule protein sequencing methods
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Teknisk informationsvetenskap.ORCID-id: 0000-0002-6753-8548
2025 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Single-molecule protein sequencing (SMPS) technologies are powerfulalternatives to mass spectrometry, offering new opportunities for highresolutionproteomics. These technologies, including nanopores, nanogaps,and fluorosequencing, enable the direct identification of protein moleculesat single-molecule resolution. Their potential spans diverse applications,from supporting cutting-edge biological research to developing diagnosticsand therapeutics. However, SMPS platforms generate complex and noisysignals in large volumes, making computational analysis a key bottleneck inunlocking their full potential.This thesis addresses that challenge by developing scalable, modelinformedand data-driven algorithms tailored to SMPS data. Drawing ontools from statistical signal processing and machine learning, the work focuseson computational methods that improve signal denoising, inference accuracy,and runtime efficiency across several SMPS technologies.The contributions span three major sensing platforms. For nanogap tunnelingdevices, a fast and robust denoising algorithm is introduced to managethe heavy-tailed noise characteristic of electronic tunneling signals. Fornanopore DNA sensing, a physics-inspired data augmentation method is proposedto improve the generalization of neural networks without requiring additionalexperimental data. Alongside this data augmentation, the thesisintroduces a novel neural network architecture that leverages the augmentation’sbenefits and incorporates modern design principles, such as residualconnections and attention mechanisms, to outperform state-of-the-art modelson a nanopore classification task.Finally, for fluorosequencing, this thesis presents two complementary contributions:(i) a fast beam search decoder for peptide inference and (ii) anexpectation-maximization framework for protein abundance estimation. Theproposed decoder achieves up to a tenfold speedup over existing methods withonly minimal loss in accuracy. Building on its output, the EM-based proteininference framework enables efficient estimation of protein abundances frompeptide-level posteriors. We demonstrate that this approach not only improvesquantification accuracy on small-scale datasets but also scales to thefull human proteome with tractable computation times, offering a viable routetoward single-molecule proteomics at large scale. Together, these tools contributeto the broader effort of making SMPS computationally tractable atthe scale required for full-proteome and single-cell analyses.All methods in this thesis have been made available as open-source software,reflecting a commitment to reproducibility and to supporting the growingSMPS research community. Through the integration of domain knowledge,algorithmic design, and computational efficiency, this thesis aims topush the boundaries of what is achievable in next-generation proteomics.

Abstract [sv]

Singelmolekylär proteinsekvensering (SMPS) utgör ett kraftfullt komplement och alternativ till masspektrometri och öppnar för nya möjligheter inom högupplöst proteomik. Tekniker som nanoporer, nanogap-strukturer och fluorosekvensering möjliggör direkt identifiering av enskilda proteinmolekyler med singelmolekylupplösning. Användningsområdet är brett—från stöd för frontlinjens biologiska forskning till utveckling av diagnostik och terapier. Samtidigt genererar SMPS-plattformar komplexa och brusiga signaler i stora volymer, vilket gör den beräkningsmässiga analysen till ett centralt hinder för  att realisera teknikernas fulla potential.

Avhandlingen adresserar denna utmaning genom att utveckla skalbara, modellunderbyggda och datadrivna algoritmer specifikt anpassade för SMPS-data. Med utgångspunkt i statistisk signalbehandling och maskininlärning utvecklas metoder som förbättrar brusreducering, inferensnoggrannhet och beräkningseffektivitet över flera SMPS-tekniker.

Bidragen spänner över tre huvudplattformar. För nanogap-baserad tunneleringssensorik presenteras en snabb och robust algoritm för brusreducering som effektivt hanterar det tungsvansade brus som är typiskt för elektroniska tunneleringssignaler. För nanoporsbaserad DNA-avläsning introduceras en fysikinspirerad dataaugmentering som höjer neurala nätverks generaliseringsförmåga utan krav på ytterligare experimentella data. I anslutning därtill föreslås en ny neuronnätsarkitektur som drar nytta av augmenteringen och införlivar moderna designprinciper, bland annat residualkopplingar och uppmärksamhetsmekanismer, vilket sammantaget överträffar state-of-the-art avancerade metoder  på en nanoporklassificeringsuppgift.

För fluorosekvensering presenteras två kompletterande komponenter: (i) en snabb beam search-avkodare för peptid-inferens och (ii) ett ramverk för proteinkvantifiering baserat på Expectation Maximization (EM). Avkodaren är upp till tio gånger snabbare än befintliga metoder med endast marginell försämring i noggrannhet. Baserat på dess utdata möjliggör det EM-baserade proteininferensramverket effektiv skattning av proteinabundanser från posteriorer på peptidnivå. Vi visar att angreppssättet inte bara förbättrar kvantifieringsnoggrannheten på småskaliga dataset, utan även skalar till hela det mänskliga proteomet med hanterbara beräkningstider, och därmed erbjuder en praktiskt genomförbar väg mot singelmolekylär proteomik i stor skala. Tillsammans bidrar dessa verktyg till att göra SMPS beräkningsmässigt hanterligt i den skala som krävs för helproteom- och enkelcellsanalyser.

Samtliga metoder i avhandlingen har gjorts tillgängliga som programvara med öppen källkod, i linje med ett starkt åtagande för reproducerbarhet och för att stödja det växande forskningsfältet kring SMPS. Genom att förena domänkunskap, välgrundad algoritmdesign och beräkningseffektivitet syftar avhandlingen till att flytta fram gränserna för vad som är möjligt inom nästa generations proteomik.

Ort, förlag, år, upplaga, sidor
Kungliga Tekniska högskolan, 2025. , s. 135
Serie
TRITA-EECS-AVL ; 2025:86
Nyckelord [en]
Signal processing, Hidden Markov Models, Expectation Maximization, CUSUM, Data augmentation, Convolutional Neural Networks
Nationell ämneskategori
Annan elektroteknik och elektronik Bioinformatik (beräkningsbiologi)
Identifikatorer
URN: urn:nbn:se:kth:diva-370661ISBN: 978-91-8106-409-4 (tryckt)OAI: oai:DiVA.org:kth-370661DiVA, id: diva2:2002086
Disputation
2025-11-07, F3, Lindstedtvägen 26, Stockholm, 13:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20250930

Tillgänglig från: 2025-09-30 Skapad: 2025-09-29 Senast uppdaterad: 2025-10-14Bibliografiskt granskad
Delarbeten
1. Efficient Implementation of Robust CUSUM Algorithm to Characterize Nanogaps Measurements with Heavy-Tailed Noise
Öppna denna publikation i ny flik eller fönster >>Efficient Implementation of Robust CUSUM Algorithm to Characterize Nanogaps Measurements with Heavy-Tailed Noise
2023 (Engelska)Ingår i: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE), 2023, s. 1-5Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Detection of bio-molecules through quantum tunneling currents could lead to the next-generation DNA sequencing methods. In order to analyze the stability of these sensitive devices, it is necessary to characterize their conductance switching statistics. This characterization can be realized by denoising the tunneling current signal and clustering the outcomes. The first step can be done with the CUSUM algorithm, which detects abrupt changes and has been used in similar devices. We found heavy-tailed non-Gaussian noise in the measurement setup of the experimental devices. This paper suggests an approximation in the likelihood ratio step of the CUSUM algorithm that is more robust than the simple Gaussian noise assumption and, at the same time, is computationally more efficient than computing the fitted true likelihoods.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2023
Nationell ämneskategori
Signalbehandling Nanoteknik Annan elektroteknik och elektronik
Identifikatorer
urn:nbn:se:kth:diva-333693 (URN)10.1109/ICASSP49357.2023.10096779 (DOI)2-s2.0-86000377312 (Scopus ID)
Konferens
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, June 4-10, 2023
Anmärkning

Part of ISBN 9781728163277

QC 20250623

Tillgänglig från: 2023-08-09 Skapad: 2023-08-09 Senast uppdaterad: 2025-09-29Bibliografiskt granskad
2. Brownian motion data augmentation: a method to push neural network performance on nanopore sensors
Öppna denna publikation i ny flik eller fönster >>Brownian motion data augmentation: a method to push neural network performance on nanopore sensors
2025 (Engelska)Ingår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 41, nr 6, artikel-id btaf323Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motivation: Nanopores are highly sensitive sensors that have achieved commercial success in DNA/RNA sequencing, with potential applications in protein sequencing and biomarker identification. Solid-state nanopores, in particular, face challenges such as instability and low signal-to-noise ratios, which lead scientists to adopt data-driven methods for nanopore signal analysis, although data acquisition remains restrictive.

Results: We address this data scarcity by augmenting the training samples with traces that emulate Brownian motion effects, based on dynamic models in the literature. We apply this method to a publicly available dataset of a classification task containing nanopore reads of DNA with encoded barcodes. A neural network named QuipuNet was previously published for this dataset, and we demonstrate that our augmentation method produces a noticeable increase in QuipuNet’s accuracy. Furthermore, we introduce a novel neural network named YupanaNet, which achieves greater accuracy (95.8%) than QuipuNet (94.6%) on the same dataset. YupanaNet benefits from both the enhanced generalization provided by Brownian motion data augmentation and the incorporation of novel architectures, including skip connections and a soft attention mask.

Availability and implementation: The source code and data are available at: https://github.com/JavierKipen/browDataAug.

Ort, förlag, år, upplaga, sidor
Oxford University Press (OUP), 2025
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Identifikatorer
urn:nbn:se:kth:diva-370570 (URN)10.1093/bioinformatics/btaf323 (DOI)001519659800001 ()40439147 (PubMedID)2-s2.0-105010569058 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet, 2018-06169Stiftelsen för strategisk forskning (SSF), SSF Grant ITM17-0049
Anmärkning

QC 20251007

Tillgänglig från: 2025-09-26 Skapad: 2025-09-26 Senast uppdaterad: 2025-10-07Bibliografiskt granskad
3. Beam search decoder for enhancing sequence decoding speed in single-molecule peptide sequencing data
Öppna denna publikation i ny flik eller fönster >>Beam search decoder for enhancing sequence decoding speed in single-molecule peptide sequencing data
2023 (Engelska)Ingår i: PloS Computational Biology, ISSN 1553-734X, E-ISSN 1553-7358, Vol. 19, nr 11, artikel-id e1011345Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Next-generation single-molecule protein sequencing technologies have the potential to significantly accelerate biomedical research. These technologies offer sensitivity and scalability for proteomic analysis. One auspicious method is fluorosequencing, which involves: cutting naturalized proteins into peptides, attaching fluorophores to specific amino acids, and observing variations in light intensity as one amino acid is removed at a time. The original peptide is classified from the sequence of light-intensity reads, and proteins can subsequently be recognized with this information. The amino acid step removal is achieved by attaching the peptides to a wall on the C-terminal and using a process called Edman Degradation to remove an amino acid from the N-Terminal. Even though a framework (Whatprot) has been proposed for the peptide classification task, processing times remain restrictive due to the massively parallel data acquisicion system. In this paper, we propose a new beam search decoder with a novel state formulation that obtains considerably lower processing times at the expense of only a slight accuracy drop compared to Whatprot. Furthermore, we explore how our novel state formulation may lead to even faster decoders in the future.

Ort, förlag, år, upplaga, sidor
Public Library of Science (PLoS), 2023
Nationell ämneskategori
Biokemi Molekylärbiologi
Identifikatorer
urn:nbn:se:kth:diva-340116 (URN)10.1371/journal.pcbi.1011345 (DOI)37934778 (PubMedID)2-s2.0-85176315601 (Scopus ID)
Anmärkning

QC 20231128

Tillgänglig från: 2023-11-28 Skapad: 2023-11-28 Senast uppdaterad: 2025-09-29Bibliografiskt granskad
4. Protein Abundance Inference via Expectation Maximization in Fluorosequencing
Öppna denna publikation i ny flik eller fönster >>Protein Abundance Inference via Expectation Maximization in Fluorosequencing
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

Fluorosequencing produces millions of single-peptide reads, yet a principled strategy for converting these data into quantitative protein abundances has been lacking. We introduce a probabilistic framework that adapts expectation maximization to the fluorosequencing measurement process, estimating relative protein abundances with peptide inference results delivered by previously developed peptide-classification tools. The algorithm iteratively updates protein abundances, maximising the likelihood of the observed reads by obtaining more accurate protein abundance estimations.

We first assess performance on simulated five-protein mixtures that reflect realistic labelling and system errors. A simple Python implementation processes one million reads in under ten seconds on a standard work-station and lowers the mean absolute error in relative abundance by more than an order of magnitude compared with a uniform-abundance guess, demonstrating robustness in protein inference for small-scale settings.

Scalability is then evaluated with simulations of the complete human proteome (20 642 proteins). Ten million reads are processed in less than four hours on a NVIDIA DGX system using one Tesla V100 GPU, confirming that the method remains tractable at proteome scale. Using error rates characteristic of current fluorosequencing, the algorithm produces marginal improvements in relative abundance accuracy. However, when error rates were artificially lowered, estimation error decreased significantly. This result suggests that improvements in fluorosequencing chemistry could directly translate into substantially more accurate quantitative proteomics with this computational framework.

Together, these results establish EM-based inference as a scalable model-driven bridge between peptide-level classification and protein-level quantification in fluorosequencing, laying computational groundwork for high-throughput single-molecule proteomics. Furthermore, the proposed protein inference framework can also be used as a refinement step within other inference methods, enhancing their protein abundance estimates.

Nationell ämneskategori
Algoritmer Bioinformatik (beräkningsbiologi)
Identifikatorer
urn:nbn:se:kth:diva-370572 (URN)
Anmärkning

QC 20251009

Tillgänglig från: 2025-09-26 Skapad: 2025-09-26 Senast uppdaterad: 2025-10-09Bibliografiskt granskad

Open Access i DiVA

DoctoralThesisJavierKipen(61097 kB)158 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 61097 kBChecksumma SHA-512
e58a067e416d51f413d242678d4a7b3b5e770930955d85b22d7afecf74499c892144e821283a6b3552442e3b2e06ea0a73679ec47dd669d5989cbed150c6090e
Typ fulltextMimetyp application/pdf

Person

Kipen, Javier

Sök vidare i DiVA

Av författaren/redaktören
Kipen, Javier
Av organisationen
Teknisk informationsvetenskap
Annan elektroteknik och elektronikBioinformatik (beräkningsbiologi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 1277 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf