kth.sePublications KTH
12342 of 4
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Learning Representations for Tandem Mass Spectra: Self-Supervised Methods and Inductive Biases
KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Gene Technology.ORCID iD: 0000-0002-3181-3800
2026 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Mass spectrometry (MS) is central to modern proteomics, enabling analysis of proteins and peptides based on their mass-to-charge ratio. Tandem mass spectrometry (MS2) encodes peptide fragmentation patterns and forms the basis for sequence identification. While database search has long dominated this process, deep learning has opened new paths for the direct interpretation of spectra. This thesis investigates how neural networks can learn representations of MS2 spectra. Two complementary research directions are explored.

First, selected self-supervised pretraining strategies are evaluated through controlled downstream experiments using encoders pretrained on unlabeled MS2 corpora. Self-distillation yields global embeddings that implicitly encode aspects of peptide chemical properties, and masked autoencoding provides modest improvements in de novo optimization and accuracy. However, the resulting improvements fall short of state-of-the-art supervised de novo sequencing performance.

Second, we introduce Pairwise Attention, a transformer architecture that incorporates a domain-aligned relational inductive bias by conditioning attention on pairwise mass differences between peaks. This yields consistent performance improvements on standard de novo sequencing benchmarks and strong generalization across datasets.

Overall, the results show that self-supervised learning can recover meaningful structure from raw MS2 data, while architectural inductive biases currently offer the most robust and reliable gains for de novo peptide sequencing.

Abstract [sv]

Masspektrometrin (MS) är central inom modern proteomik och möjliggör analysav proteiner och peptider baserat på deras massa. Tandem-masspektrometri (MS2)kodar fragmenteringsmönster för peptider och utgör grunden för sekvensidentifiering. Även om databassökning länge har dominerat denna process har djupinlärning öppnat nya möjligheter för direkt tolkning av spektra.

Denna avhandling undersöker hur neurala nätverk kan lära sig representationer av MS2-spektra. Två kompletterande forskningsinriktningar studeras.

Först utvärderas utvalda självövervakade förträningsstrategier genom kontrollerade experiment med encoders som förtränats på oetiketterade MS2-korpusar. Självdistillation ger globala inbäddningar som implicit kodar aspekter av peptiders kemiska egenskaper, och masked autoencoding ger måttliga förbättringar i de novo-precision. De resulterande förbättringarna når dock inte upp till prestandan hos dagens state-of-the-art-metoder för övervakad de novo-sekvensering.

Sedan introduceras Pairwise Attention, en transformerarkitektur som inkorporerar en domänanpassad induktiv bias genom att villkora Attention på parvisa masskillnader mellan toppar. Detta ger prestandaförbättringar på etablerade de novo-sekvenseringsbenchmarkar samt stark generalisering över dataset.

Sammantaget visar resultaten att självövervakad inlärning kan återvinna meningsfull struktur ur råa MS2-data, medan induktiva biaser för närvarande erbjuder de mest robusta förbättringarna för de novo-peptidsekvensering.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2026. , p. 45
Series
TRITA-CBH-FOU ; 2026:21
Keywords [en]
Mass Spectrometry, Deep Learning, De Novo Sequencing, Self-Supervised Learning
National Category
Bioinformatics and Computational Biology
Research subject
Biotechnology
Identifiers
URN: urn:nbn:se:kth:diva-378805ISBN: 978-91-8106-586-2 (print)OAI: oai:DiVA.org:kth-378805DiVA, id: diva2:2049215
Presentation
2026-04-17, Pascal, Gamma-6, Tomtebodavägen 23, Solna, Stockholm, 13:15 (English)
Opponent
Supervisors
Note

QC 2026-03-27

Available from: 2026-03-27 Created: 2026-03-27 Last updated: 2026-03-30Bibliographically approved
List of papers
1. Pairwise Attention: Leveraging Mass Differences to Enhance De Novo Sequencing of Mass Spectra
Open this publication in new window or tab >>Pairwise Attention: Leveraging Mass Differences to Enhance De Novo Sequencing of Mass Spectra
2025 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 24, no 7, p. 3722-3730Article in journal (Refereed) Published
Abstract [en]

A fundamental challenge in mass spectrometry-based proteomics is determining which peptide generated a given MS2 spectrum. Peptide sequencing typically relies on matching spectra against a known sequence database, which in some applications is not available. Deep learning-based de novo sequencing can address this limitation by directly predicting peptide sequences from MS2 data. We have seen the application of the transformer architecture to de novo sequencing produce state-of-the-art results on the so-called nine-species benchmark. In this study, we propose an improved transformer encoder inspired by the heuristics used in the manual interpretation of spectra. We modify the attention mechanism with a learned bias based on pairwise mass differences, termed Pairwise Attention (PA). Adding PA improves average peptide precision at 100% coverage by 12.7% (5.9 percentage points) over our base transformer on the original nine-species benchmark. We have also achieved a 7.4% increase over the previously published model Casanovo. Our MS2 encoding strategy is largely orthogonal to other transformer-based models encoding MS2 spectra, enabling straightforward integration into existing deep-learning approaches. Our results show that integrating domain-specific knowledge into transformers boosts de novo sequencing performance.

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2025
Keywords
Attention, De novo sequencing, Mass spectrometry, MS2, Proteomics, Transformers
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-364463 (URN)10.1021/acs.jproteome.5c00063 (DOI)001500093400001 ()40454436 (PubMedID)2-s2.0-105007317018 (Scopus ID)
Note

QC 20260127

Available from: 2025-06-12 Created: 2025-06-12 Last updated: 2026-03-27Bibliographically approved
2. Self-Supervised Learning for Tandem Mass Spectra: Methods, Dynamics and Downstream Effects
Open this publication in new window or tab >>Self-Supervised Learning for Tandem Mass Spectra: Methods, Dynamics and Downstream Effects
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Self-supervised learning provides a way to extract structure from tandem mass spectra without relying on peptide labels, which are typically obtained from database-search pipelines and therefore inherit their assumptions and biases. This work evaluates several self-supervised objectives: masked spectrum modeling, masked autoencoding, trinary m/zperturbation, and DINO-style self-distillation, using a shared transformer encoder trained on large unlabeled MS2 corpora. We analyze their training behavior, document characteristic failure modes such as collapse in DINO, and measure their effect on downstream de novo peptide sequencing and auxiliary prediction tasks.

Across controlled fine-tuning settings, masked autoencoding yields the most consistent improvements in de novo accuracy, with measurable gains even after very limited pretraining. DINO provides modest but reproducible improvements over scratch for de novo decoding and strong gains on global tasks, whereas the trinary perturbation objective produces only small and often inconsistent benefits. These results demonstrate that unsupervised objectives can recover meaningful structure from raw spectra, although the absolute de novo accuracies achieved here lie below those of state-of-the-art supervised systems, meaning the observed gains should be interpreted primarily as an initialization ablation rather than an indication of absolute model capability.

Overall, the study shows that self-supervision can influence MS2 representations in useful ways, clarifies which objectives are effective for current transformer architectures, and highlights the need for MS-specific pretraining tasks that more directly support high-quality sequence reconstruction

Keywords
Deep Learning, Mass Spectrometry, Self-Supervised Learning
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-378799 (URN)
Note

QC 20260331

Available from: 2026-03-27 Created: 2026-03-27 Last updated: 2026-03-31Bibliographically approved

Open Access in DiVA

Kappa(1420 kB)30 downloads
File information
File name FULLTEXT01.pdfFile size 1420 kBChecksum SHA-512
5facd7d9e408f94953a6d410e92ad2d59ea6700175d0a835b887984481ea8b19705963b02ead0b08aa63e347f06cdfc963ab871c58af63de44b59ff34de5023b
Type summaryMimetype application/pdf

Authority records

Nilsson, Alfred

Search in DiVA

By author/editor
Nilsson, Alfred
By organisation
Gene Technology
Bioinformatics and Computational Biology

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 103 hits
12342 of 4
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf