Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Distillation of label-free quantification data by clustering and Bayesian modeling
KTH, Skolan för kemi, bioteknologi och hälsa (CBH), Genteknologi.ORCID-id: 0000-0002-5401-5553
KTH, Skolan för kemi, bioteknologi och hälsa (CBH), Genteknologi.ORCID-id: 0000-0001-5689-9797
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

In shotgun proteomics, the amount of information that can be extracted from label-free quantification experiments is typically limited by the identification rate as well as the noise level of the quantitative signals. This generally causes a low sensitivity in differential expression analysis on protein level. Here, we present a new method, MaRaQuant, in which we reverse the typical identification-first workflow into a quantification-first approach. Specifically, we apply unsupervised clustering on both MS1 and MS2 level to summarize all analytes of interest without assigning identities. This ensures that no valuable information is discarded due to analytes missing identification thresholds and allows us to spend more effort on the identification process due to the data reduction achieved by clustering. Furthermore, we propagate error probabilities from feature level all the way to protein level and input these to our probabilistic protein quantification method, Triqler. Applying this methodology to an engineered dataset, we managed to identify multiple analytes of interest that would have gone unnoticed in traditional pipelines, specifically, through the use of open modification and de novo searches. MaRaQuant/Triqler obtains significantly more identifications on all levels compared to MaxQuant/Perseus, including differentially expressed proteins. Notably, we managed to identify differentially expressed proteins in a clinical dataset where previously none were discovered. Furthermore, our differentially expressed proteins allowed us to attribute multiple functional annotation terms to both clinical datasets that we investigated.

Nyckelord [en]
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein quantification, large-scale studies, clustering, machine learning
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Bioteknologi
Identifikatorer
URN: urn:nbn:se:kth:diva-235627OAI: oai:DiVA.org:kth-235627DiVA, id: diva2:1252230
Anmärkning

QC 20181001

Tillgänglig från: 2018-10-01 Skapad: 2018-10-01 Senast uppdaterad: 2018-10-01Bibliografiskt granskad
Ingår i avhandling
1. Statistical and machine learning methods to analyze large-scale mass spectrometry data
Öppna denna publikation i ny flik eller fönster >>Statistical and machine learning methods to analyze large-scale mass spectrometry data
2018 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Modern biology is faced with vast amounts of data that contain valuable information yet to be extracted. Proteomics, the study of proteins, has repositories with thousands of mass spectrometry experiments. These data gold mines could further our knowledge of proteins as the main actors in cell processes and signaling. Here, we explore methods to extract more information from this data using statistical and machine learning methods.

First, we present advances for studies that aggregate hundreds of runs. We introduce MaRaCluster, which clusters mass spectra for large-scale datasets using statistical methods to assess similarity of spectra. It identified up to 40% more peptides than the state-of-the-art method, MS-Cluster. Further, we accommodated large-scale data analysis in Percolator, a popular post-processing tool for mass spectrometry data. This reduced the runtime for a draft human proteome study from a full day to 10 minutes.

Second, we clarify and promote the contentious topic of protein false discovery rates (FDRs). Often, studies report lists of proteins but fail to report protein FDRs. We provide a framework to systematically discuss protein FDRs and take away hesitance. We also added protein FDRs to Percolator, opting for the best-peptide approach which proved superior in a benchmark of scalable protein inference methods.

Third, we tackle the low sensitivity of protein quantification methods. Current methods lack proper control of error sources and propagation. To remedy this, we developed Triqler, which controls the protein quantification FDR through a Bayesian framework. We also introduce MaRaQuant, which proposes a quantification-first approach that applies clustering prior to identification. This reduced the number of spectra to be searched and allowed us to spot unidentified analytes of interest. Combining these tools outperformed the state-of-the-art method, MaxQuant/Perseus, and found enriched functional terms for datasets that had none before.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2018. s. 64
Serie
TRITA-CBH-FOU ; 2018:45
Nyckelord
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies, simulation, protein quantification, clustering, machine learning, Bayesian statistics
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Bioteknologi
Identifikatorer
urn:nbn:se:kth:diva-235629 (URN)978-91-7729-967-7 (ISBN)
Disputation
2018-10-24, Atrium, Nobels väg 12B, Solna, 13:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20181001

Tillgänglig från: 2018-10-01 Skapad: 2018-10-01 Senast uppdaterad: 2018-10-01Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Personposter BETA

Käll, Lukas

Sök vidare i DiVA

Av författaren/redaktören
The, MatthewKäll, Lukas
Av organisationen
Genteknologi
Bioinformatik (beräkningsbiologi)

Sök vidare utanför DiVA

GoogleGoogle Scholar

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 488 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf