Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Statistical and machine learning methods to analyze large-scale mass spectrometry data
KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.ORCID iD: 0000-0002-5401-5553
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Modern biology is faced with vast amounts of data that contain valuable information yet to be extracted. Proteomics, the study of proteins, has repositories with thousands of mass spectrometry experiments. These data gold mines could further our knowledge of proteins as the main actors in cell processes and signaling. Here, we explore methods to extract more information from this data using statistical and machine learning methods.

First, we present advances for studies that aggregate hundreds of runs. We introduce MaRaCluster, which clusters mass spectra for large-scale datasets using statistical methods to assess similarity of spectra. It identified up to 40% more peptides than the state-of-the-art method, MS-Cluster. Further, we accommodated large-scale data analysis in Percolator, a popular post-processing tool for mass spectrometry data. This reduced the runtime for a draft human proteome study from a full day to 10 minutes.

Second, we clarify and promote the contentious topic of protein false discovery rates (FDRs). Often, studies report lists of proteins but fail to report protein FDRs. We provide a framework to systematically discuss protein FDRs and take away hesitance. We also added protein FDRs to Percolator, opting for the best-peptide approach which proved superior in a benchmark of scalable protein inference methods.

Third, we tackle the low sensitivity of protein quantification methods. Current methods lack proper control of error sources and propagation. To remedy this, we developed Triqler, which controls the protein quantification FDR through a Bayesian framework. We also introduce MaRaQuant, which proposes a quantification-first approach that applies clustering prior to identification. This reduced the number of spectra to be searched and allowed us to spot unidentified analytes of interest. Combining these tools outperformed the state-of-the-art method, MaxQuant/Perseus, and found enriched functional terms for datasets that had none before.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2018. , p. 64
Series
TRITA-CBH-FOU ; 2018:45
Keywords [en]
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies, simulation, protein quantification, clustering, machine learning, Bayesian statistics
National Category
Bioinformatics (Computational Biology)
Research subject
Biotechnology
Identifiers
URN: urn:nbn:se:kth:diva-235629ISBN: 978-91-7729-967-7 (print)OAI: oai:DiVA.org:kth-235629DiVA, id: diva2:1252252
Public defence
2018-10-24, Atrium, Nobels väg 12B, Solna, 13:00 (English)
Opponent
Supervisors
Note

QC 20181001

Available from: 2018-10-01 Created: 2018-10-01 Last updated: 2018-10-01Bibliographically approved
List of papers
1. MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics
Open this publication in new window or tab >>MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics
2016 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 15, no 3, p. 713-720Article in journal (Refereed) Published
Abstract [en]

Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/ statisticalbiotechnology/maracluster (under an Apache 2.0 license).

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2016
Keywords
Mass spectrometry, proteomics, hierarchical clustering bioinformatics, database search, spectral archives, spectral libraries
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-184544 (URN)10.1021/acs.jproteome.5b00749 (DOI)000371754100005 ()26653874 (PubMedID)2-s2.0-84960456163 (Scopus ID)
Funder
Science for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160406

Available from: 2016-04-06 Created: 2016-04-01 Last updated: 2018-10-01Bibliographically approved
2. How to talk about protein-level false discovery rates in shotgun proteomics
Open this publication in new window or tab >>How to talk about protein-level false discovery rates in shotgun proteomics
2016 (English)In: Proteomics, ISSN 1615-9853, E-ISSN 1615-9861, Vol. 16, no 18, p. 2461-2469Article in journal (Refereed) Published
Abstract [en]

A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate (FDR). Many consider protein-level FDRs a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level FDRs, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the FDR. Furthermore, we demonstrate how the same simulations can be used to verify FDR estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level FDRs for both competing null hypotheses.

Place, publisher, year, edition, pages
Wiley-Blackwell, 2016
Keywords
Bioinformatics, Data processing and analysis, Mass spectrometry-LC-MS/MS, Protein inference, Simulation, Statistical analysis
National Category
Biophysics Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-196441 (URN)10.1002/pmic.201500431 (DOI)000385813600005 ()27503675 (PubMedID)2-s2.0-84988369698 (Scopus ID)
Note

QC 20161129

Available from: 2016-11-29 Created: 2016-11-14 Last updated: 2018-10-01Bibliographically approved
3. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0
Open this publication in new window or tab >>Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0
2016 (English)In: Journal of the American Society for Mass Spectrometry, ISSN 1044-0305, E-ISSN 1879-1123, Vol. 27, no 11, p. 1719-1727Article in journal (Refereed) Published
Abstract [en]

Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator’s processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method—grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein—in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. [Figure not available: see fulltext.]

Place, publisher, year, edition, pages
Springer, 2016
Keywords
Data processing and analysis, Large scale studies, Mass spectrometry - LC-MS/MS, Protein inference, Statistical analysis, Bioinformatics, Data handling, Mass spectrometry, Molecular biology, Peptides, Probability, Statistical methods, Error probabilities, False discovery rate, Large-scale studies, LC-MS/MS, Scalable approach, Shotgun proteomics, Statistical confidence, Proteins
National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-195221 (URN)10.1007/s13361-016-1460-7 (DOI)000385158400002 ()2-s2.0-84991105210 (Scopus ID)
Note

QC 20161117

Available from: 2016-11-17 Created: 2016-11-02 Last updated: 2018-10-01Bibliographically approved
4. Integrated identification and quantification error probabilities for shotgun proteomics
Open this publication in new window or tab >>Integrated identification and quantification error probabilities for shotgun proteomics
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Protein quantification by label-free shotgun proteomics experiments is plagued by a multitude of error sources. Typical pipelines for identifying differentially expressed proteins use intermediate filters in an attempt to control the error rate. However, they often ignore certain error sources and, moreover, regard filtered lists as completely correct in subsequent steps. These two indiscretions can easily lead to a loss of control of the false discovery rate (FDR). We propose a probabilistic graphical model, Triqler, that propagates error information through all steps, employing distributions in favor of point estimates, most notably for missing value imputation. The model outputs posterior probabilities for fold changes between treatment groups, highlighting uncertainty rather than hiding it. We analyzed 3 engineered datasets and achieved FDR control and high sensitivity, even for truly absent proteins. In a bladder cancer clinical dataset we discovered 35 proteins at 5% FDR, with the original study discovering none at this threshold. Compellingly, these proteins showed enrichment for functional annotation terms. The model executes in minutes and is freely available at https://pypi.org/project/triqler/.

Keywords
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein quantification, large-scale studies, Bayesian statistics
National Category
Bioinformatics (Computational Biology)
Research subject
Biotechnology
Identifiers
urn:nbn:se:kth:diva-235625 (URN)
Note

QC 20181001

Available from: 2018-10-01 Created: 2018-10-01 Last updated: 2018-10-01Bibliographically approved
5. Distillation of label-free quantification data by clustering and Bayesian modeling
Open this publication in new window or tab >>Distillation of label-free quantification data by clustering and Bayesian modeling
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In shotgun proteomics, the amount of information that can be extracted from label-free quantification experiments is typically limited by the identification rate as well as the noise level of the quantitative signals. This generally causes a low sensitivity in differential expression analysis on protein level. Here, we present a new method, MaRaQuant, in which we reverse the typical identification-first workflow into a quantification-first approach. Specifically, we apply unsupervised clustering on both MS1 and MS2 level to summarize all analytes of interest without assigning identities. This ensures that no valuable information is discarded due to analytes missing identification thresholds and allows us to spend more effort on the identification process due to the data reduction achieved by clustering. Furthermore, we propagate error probabilities from feature level all the way to protein level and input these to our probabilistic protein quantification method, Triqler. Applying this methodology to an engineered dataset, we managed to identify multiple analytes of interest that would have gone unnoticed in traditional pipelines, specifically, through the use of open modification and de novo searches. MaRaQuant/Triqler obtains significantly more identifications on all levels compared to MaxQuant/Perseus, including differentially expressed proteins. Notably, we managed to identify differentially expressed proteins in a clinical dataset where previously none were discovered. Furthermore, our differentially expressed proteins allowed us to attribute multiple functional annotation terms to both clinical datasets that we investigated.

Keywords
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein quantification, large-scale studies, clustering, machine learning
National Category
Bioinformatics (Computational Biology)
Research subject
Biotechnology
Identifiers
urn:nbn:se:kth:diva-235627 (URN)
Note

QC 20181001

Available from: 2018-10-01 Created: 2018-10-01 Last updated: 2018-10-01Bibliographically approved

Open Access in DiVA

fulltext(1529 kB)76 downloads
File information
File name FULLTEXT01.pdfFile size 1529 kBChecksum SHA-512
7c09c36e04442aaea9bf34404e209e57bde31f9e4d53cceac68ee247ce0b78c031073646bef56083077ea628e8632f3346fb298afa9a2a3ab6a5a47419a9b166
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
The, Matthew
By organisation
Gene TechnologyScience for Life Laboratory, SciLifeLab
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 76 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 810 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf