Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics
KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.ORCID iD: 0000-0002-5401-5553
KTH, School of Biotechnology (BIO), Gene Technology. KTH, Centres, Science for Life Laboratory, SciLifeLab.ORCID iD: 0000-0001-5689-9797
2016 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 15, no 3, p. 713-720Article in journal (Refereed) Published
Resource type
Text
Abstract [en]

Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/ statisticalbiotechnology/maracluster (under an Apache 2.0 license).

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2016. Vol. 15, no 3, p. 713-720
Keywords [en]
Mass spectrometry, proteomics, hierarchical clustering bioinformatics, database search, spectral archives, spectral libraries
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:kth:diva-184544DOI: 10.1021/acs.jproteome.5b00749ISI: 000371754100005PubMedID: 26653874Scopus ID: 2-s2.0-84960456163OAI: oai:DiVA.org:kth-184544DiVA, id: diva2:917308
Funder
Science for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160406

Available from: 2016-04-06 Created: 2016-04-01 Last updated: 2018-10-01Bibliographically approved
In thesis
1. Statistical and machine learning methods to analyze large-scale mass spectrometry data
Open this publication in new window or tab >>Statistical and machine learning methods to analyze large-scale mass spectrometry data
2016 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

As in many other fields, biology is faced with enormous amounts ofdata that contains valuable information that is yet to be extracted. The field of proteomics, the study of proteins, has the luxury of having large repositories containing data from tandem mass-spectrometry experiments, readily accessible for everyone who is interested. At the same time, there is still a lot to discover about proteins as the main actors in cell processes and cell signaling.

In this thesis, we explore several methods to extract more information from the available data using methods from statistics and machine learning. In particular, we introduce MaRaCluster, a new method for clustering mass spectra on large-scale datasets. This method uses statistical methods to assess similarity between mass spectra, followed by the conservative complete-linkage clustering algorithm.The combination of these two resulted in up to 40% more peptide identifications on its consensus spectra compared to the state of the art method.

Second, we attempt to clarify and promote protein-level false discovery rates (FDRs). Frequently, studies fail to report protein-level FDRs even though the proteins are actually the entities of interest. We provided a framework in which to discuss protein-level FDRs in a systematic manner to open up the discussion and take away potential hesitance. We also benchmarked some scalable protein inference methods and included the best one in the Percolator package. Furthermore, we added functionality to the Percolator package to accommodate the analysis of studies in which many runs are aggregated. This reduced the run time for a recent study regarding a draft human proteome from almost a full day to just 10 minutes on a commodity computer, resulting in a list of proteins together with their corresponding protein-level FDRs.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. p. vi, 44
Series
TRITA-BIO-Report, ISSN 1654-2312 ; 2016:3
Keywords
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies, simulation
National Category
Bioinformatics and Systems Biology
Research subject
Biotechnology
Identifiers
urn:nbn:se:kth:diva-185149 (URN)978-91-7595-933-7 (ISBN)
Presentation
2016-05-03, Pascal, våning 6 i Gamma-huset, Science for Life Laboratory, Tomtebodavägen 23, Solna, 13:00 (English)
Opponent
Supervisors
Note

QC 20160412

Available from: 2016-04-12 Created: 2016-04-11 Last updated: 2016-04-12Bibliographically approved
2. Statistical and machine learning methods to analyze large-scale mass spectrometry data
Open this publication in new window or tab >>Statistical and machine learning methods to analyze large-scale mass spectrometry data
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Modern biology is faced with vast amounts of data that contain valuable information yet to be extracted. Proteomics, the study of proteins, has repositories with thousands of mass spectrometry experiments. These data gold mines could further our knowledge of proteins as the main actors in cell processes and signaling. Here, we explore methods to extract more information from this data using statistical and machine learning methods.

First, we present advances for studies that aggregate hundreds of runs. We introduce MaRaCluster, which clusters mass spectra for large-scale datasets using statistical methods to assess similarity of spectra. It identified up to 40% more peptides than the state-of-the-art method, MS-Cluster. Further, we accommodated large-scale data analysis in Percolator, a popular post-processing tool for mass spectrometry data. This reduced the runtime for a draft human proteome study from a full day to 10 minutes.

Second, we clarify and promote the contentious topic of protein false discovery rates (FDRs). Often, studies report lists of proteins but fail to report protein FDRs. We provide a framework to systematically discuss protein FDRs and take away hesitance. We also added protein FDRs to Percolator, opting for the best-peptide approach which proved superior in a benchmark of scalable protein inference methods.

Third, we tackle the low sensitivity of protein quantification methods. Current methods lack proper control of error sources and propagation. To remedy this, we developed Triqler, which controls the protein quantification FDR through a Bayesian framework. We also introduce MaRaQuant, which proposes a quantification-first approach that applies clustering prior to identification. This reduced the number of spectra to be searched and allowed us to spot unidentified analytes of interest. Combining these tools outperformed the state-of-the-art method, MaxQuant/Perseus, and found enriched functional terms for datasets that had none before.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2018. p. 64
Series
TRITA-CBH-FOU ; 2018:45
Keywords
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies, simulation, protein quantification, clustering, machine learning, Bayesian statistics
National Category
Bioinformatics (Computational Biology)
Research subject
Biotechnology
Identifiers
urn:nbn:se:kth:diva-235629 (URN)978-91-7729-967-7 (ISBN)
Public defence
2018-10-24, Atrium, Nobels väg 12B, Solna, 13:00 (English)
Opponent
Supervisors
Note

QC 20181001

Available from: 2018-10-01 Created: 2018-10-01 Last updated: 2018-10-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records BETA

The, MatthewKäll, Lukas

Search in DiVA

By author/editor
The, MatthewKäll, Lukas
By organisation
Gene TechnologyScience for Life Laboratory, SciLifeLab
In the same journal
Journal of Proteome Research
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 123 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf