Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 12) Show all publications
The, M. & Käll, L. (2019). Integrated Identification and Quantification Error Probabilities for Shotgun Proteomics. Molecular & Cellular Proteomics, 18(3), 561-570
Open this publication in new window or tab >>Integrated Identification and Quantification Error Probabilities for Shotgun Proteomics
2019 (English)In: Molecular & Cellular Proteomics, ISSN 1535-9476, E-ISSN 1535-9484, Vol. 18, no 3, p. 561-570Article in journal (Refereed) Published
Abstract [en]

Protein quantification by label-free shotgun proteomics experiments is plagued by a multitude of error sources. Typical pipelines for identifying differential proteins use intermediate filters to control the error rate. However, they often ignore certain error sources and, moreover, regard filtered lists as completely correct in subsequent steps. These two indiscretions can easily lead to a loss of control of the false discovery rate (FDR). We propose a probabilistic graphical model, Triqler, that propagates error information through all steps, employing distributions in favor of point estimates, most notably for missing value imputation. The model outputs posterior probabilities for fold changes between treatment groups, highlighting uncertainty rather than hiding it. We analyzed 3 engineered data sets and achieved FDR control and high sensitivity, even for truly absent proteins. In a bladder cancer clinical data set we discovered 35 proteins at 5% FDR, whereas the original study discovered 1 and MaxQuant/Perseus 4 proteins at this threshold. Compellingly, these 35 proteins showed enrichment for functional annotation terms, whereas the top ranked proteins reported by MaxQuant/Perseus showed no enrichment. The model executes in minutes and is freely available at https://pypi.org/project/triqler/.

Place, publisher, year, edition, pages
AMER SOC BIOCHEMISTRY MOLECULAR BIOLOGY INC, 2019
National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-252661 (URN)10.1074/mcp.RA118.001018 (DOI)000467885100013 ()30482846 (PubMedID)2-s2.0-85062999333 (Scopus ID)
Note

QC 20190610

Available from: 2019-06-10 Created: 2019-06-10 Last updated: 2019-06-10Bibliographically approved
Halloran, J. T., Zhang, H., Kara, K., Renggli, C., The, M., Zhang, C., . . . Noble, W. S. (2019). Speeding Up Percolator. Journal of Proteome Research, 18(9), 3353-3359
Open this publication in new window or tab >>Speeding Up Percolator
Show others...
2019 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 18, no 9, p. 3353-3359Article in journal (Refereed) Published
Abstract [en]

The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.

Place, publisher, year, edition, pages
AMER CHEMICAL SOC, 2019
Keywords
tandem mass spectrometry, machine learning, support vector machine, SVM, percolator
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-261034 (URN)10.1021/acs.jproteome.9b00288 (DOI)000485089100012 ()31407580 (PubMedID)2-s2.0-85071999233 (Scopus ID)
Note

QC 20191002

Available from: 2019-10-02 Created: 2019-10-02 Last updated: 2019-10-02Bibliographically approved
The, M., Edfors, F., Perez-Riverol, Y., Payne, S. H., Hoopmann, M. R., Palmblad, M., . . . Käll, L. (2018). A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms. Journal of Proteome Research, 17(5), 1879-1886
Open this publication in new window or tab >>A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms
Show others...
2018 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 17, no 5, p. 1879-1886Article in journal (Refereed) Published
Abstract [en]

A natural way to benchmark the performance of an analytical experimental setup is to use samples of known measured analytes are peptides and not the actual proteins one of the inherent problems of interpreting data is that the composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2018
Keywords
mass spectrometry, proteomics, protein inference, sample of known content, protein standard, proteofom, peptide, homology, benchmark
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-228270 (URN)10.1021/acs.jproteome.7b00899 (DOI)000431726700013 ()29631402 (PubMedID)2-s2.0-85046675818 (Scopus ID)
Note

QC 20180522

Available from: 2018-05-22 Created: 2018-05-22 Last updated: 2018-12-05Bibliographically approved
Lee, J.-Y. -., Choi, H., Colangelo, C. M., Davis, D., Hoopmann, M. R., Käll, L., . . . Palmblad, M. (2018). ABRF Proteome Informatics Research Group (iPRG) 2016 Study: Inferring Proteoforms from Bottom-up Proteomics Data. Journal of biomolecular techniques : JBT, 29(2), 39-45
Open this publication in new window or tab >>ABRF Proteome Informatics Research Group (iPRG) 2016 Study: Inferring Proteoforms from Bottom-up Proteomics Data
Show others...
2018 (English)In: Journal of biomolecular techniques : JBT, ISSN 1943-4731, Vol. 29, no 2, p. 39-45Article in journal (Refereed) Published
Abstract [en]

This report presents the results from the 2016 Association of Biomolecular Resource Facilities Proteome Informatics Research Group (iPRG) study on proteoform inference and false discovery rate (FDR) estimation from bottom-up proteomics data. For this study, 3 replicate Q Exactive Orbitrap liquid chromatography-tandom mass spectrometry datasets were generated from each of 4 Escherichia coli samples spiked with different equimolar mixtures of small recombinant proteins selected to mimic pairs of homologous proteins. Participants were given raw data and a sequence file and asked to identify the proteins and provide estimates on the FDR at the proteoform level. As part of this study, we tested a new submission system with a format validator running on a virtual private server (VPS) and allowed methods to be provided as executable R Markdown or IPython Notebooks. The task was perceived as difficult, and only eight unique submissions were received, although those who participated did well with no one method performing best on all samples. However, none of the submissions included a complete Markdown or Notebook, even though examples were provided. Future iPRG studies need to be more successful in promoting and encouraging participation. The VPS and submission validator easily scale to much larger numbers of participants in these types of studies. The unique "ground-truth" dataset for proteoform identification generated for this study is now available to the research community, as are the server-side scripts for validating and managing submissions.

Place, publisher, year, edition, pages
NLM (Medline), 2018
Keywords
best practice, community study, false discovery rate, inference
National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-247210 (URN)10.7171/jbt.18-2902-003 (DOI)2-s2.0-85059915162 (Scopus ID)
Note

QC 20190415

Available from: 2019-04-15 Created: 2019-04-15 Last updated: 2019-04-15Bibliographically approved
Griss, J., Perez-Riverol, Y., The, M., Käll, L. & Vizcaino, J. A. (2018). Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra". Journal of Proteome Research, 17(5), 1993-1996
Open this publication in new window or tab >>Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra"
Show others...
2018 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 17, no 5, p. 1993-1996Article in journal (Refereed) Published
Abstract [en]

In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.

Place, publisher, year, edition, pages
AMER CHEMICAL SOC, 2018
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-228271 (URN)10.1021/acs.jproteome.7b00824 (DOI)000431726700024 ()29682973 (PubMedID)2-s2.0-85046629294 (Scopus ID)
Note

QC 20180522

Available from: 2018-05-22 Created: 2018-05-22 Last updated: 2018-05-22Bibliographically approved
Afkham, H. M., Qiu, X., The, M. & Käll, L. (2017). Uncertainty estimation of predictions of peptides' chromatographic retention times in shotgun proteomics. Bioinformatics, 33(4), 508-513
Open this publication in new window or tab >>Uncertainty estimation of predictions of peptides' chromatographic retention times in shotgun proteomics
2017 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 33, no 4, p. 508-513Article in journal (Refereed) Published
Abstract [en]

Motivation: Liquid chromatography is frequently used as a means to reduce the complexity of peptide-mixtures in shotgun proteomics. For such systems, the time when a peptide is released from a chromatography column and registered in the mass spectrometer is referred to as the peptide's retention time. Using heuristics or machine learning techniques, previous studies have demonstrated that it is possible to predict the retention time of a peptide from its amino acid sequence. In this paper, we are applying Gaussian Process Regression to the feature representation of a previously described predictor ELUDE. Using this framework, we demonstrate that it is possible to estimate the uncertainty of the prediction made by the model. Here we show how this uncertainty relates to the actual error of the prediction. Results: In our experiments, we observe a strong correlation between the estimated uncertainty provided by Gaussian Process Regression and the actual prediction error. This relation provides us with new means for assessment of the predictions. We demonstrate how a subset of the peptides can be selected with lower prediction error compared to the whole set. We also demonstrate how such predicted standard deviations can be used for designing adaptive windowing strategies.

Place, publisher, year, edition, pages
OXFORD UNIV PRESS, 2017
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-205074 (URN)10.1093/bioinformatics/btw619 (DOI)000397264100006 ()2-s2.0-85028336596 (Scopus ID)
Note

QC 20170626

Available from: 2017-06-26 Created: 2017-06-26 Last updated: 2018-09-19Bibliographically approved
The, M., MacCoss, M. J., Noble, W. S. & Käll, L. (2016). Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. Journal of the American Society for Mass Spectrometry, 27(11), 1719-1727
Open this publication in new window or tab >>Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0
2016 (English)In: Journal of the American Society for Mass Spectrometry, ISSN 1044-0305, E-ISSN 1879-1123, Vol. 27, no 11, p. 1719-1727Article in journal (Refereed) Published
Abstract [en]

Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator’s processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method—grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein—in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. [Figure not available: see fulltext.]

Place, publisher, year, edition, pages
Springer, 2016
Keywords
Data processing and analysis, Large scale studies, Mass spectrometry - LC-MS/MS, Protein inference, Statistical analysis, Bioinformatics, Data handling, Mass spectrometry, Molecular biology, Peptides, Probability, Statistical methods, Error probabilities, False discovery rate, Large-scale studies, LC-MS/MS, Scalable approach, Shotgun proteomics, Statistical confidence, Proteins
National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-195221 (URN)10.1007/s13361-016-1460-7 (DOI)000385158400002 ()2-s2.0-84991105210 (Scopus ID)
Note

QC 20161117

Available from: 2016-11-17 Created: 2016-11-02 Last updated: 2018-10-01Bibliographically approved
The, M., Tasnim, A. & Käll, L. (2016). How to talk about protein-level false discovery rates in shotgun proteomics. Proteomics, 16(18), 2461-2469
Open this publication in new window or tab >>How to talk about protein-level false discovery rates in shotgun proteomics
2016 (English)In: Proteomics, ISSN 1615-9853, E-ISSN 1615-9861, Vol. 16, no 18, p. 2461-2469Article in journal (Refereed) Published
Abstract [en]

A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate (FDR). Many consider protein-level FDRs a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level FDRs, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the FDR. Furthermore, we demonstrate how the same simulations can be used to verify FDR estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level FDRs for both competing null hypotheses.

Place, publisher, year, edition, pages
Wiley-Blackwell, 2016
Keywords
Bioinformatics, Data processing and analysis, Mass spectrometry-LC-MS/MS, Protein inference, Simulation, Statistical analysis
National Category
Biophysics Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-196441 (URN)10.1002/pmic.201500431 (DOI)000385813600005 ()27503675 (PubMedID)2-s2.0-84988369698 (Scopus ID)
Note

QC 20161129

Available from: 2016-11-29 Created: 2016-11-14 Last updated: 2018-10-01Bibliographically approved
The, M. & Käll, L. (2016). MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. Journal of Proteome Research, 15(3), 713-720
Open this publication in new window or tab >>MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics
2016 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 15, no 3, p. 713-720Article in journal (Refereed) Published
Abstract [en]

Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/ statisticalbiotechnology/maracluster (under an Apache 2.0 license).

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2016
Keywords
Mass spectrometry, proteomics, hierarchical clustering bioinformatics, database search, spectral archives, spectral libraries
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-184544 (URN)10.1021/acs.jproteome.5b00749 (DOI)000371754100005 ()26653874 (PubMedID)2-s2.0-84960456163 (Scopus ID)
Funder
Science for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160406

Available from: 2016-04-06 Created: 2016-04-01 Last updated: 2018-10-01Bibliographically approved
The, M. (2016). Statistical and machine learning methods to analyze large-scale mass spectrometry data. (Licentiate dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Statistical and machine learning methods to analyze large-scale mass spectrometry data
2016 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

As in many other fields, biology is faced with enormous amounts ofdata that contains valuable information that is yet to be extracted. The field of proteomics, the study of proteins, has the luxury of having large repositories containing data from tandem mass-spectrometry experiments, readily accessible for everyone who is interested. At the same time, there is still a lot to discover about proteins as the main actors in cell processes and cell signaling.

In this thesis, we explore several methods to extract more information from the available data using methods from statistics and machine learning. In particular, we introduce MaRaCluster, a new method for clustering mass spectra on large-scale datasets. This method uses statistical methods to assess similarity between mass spectra, followed by the conservative complete-linkage clustering algorithm.The combination of these two resulted in up to 40% more peptide identifications on its consensus spectra compared to the state of the art method.

Second, we attempt to clarify and promote protein-level false discovery rates (FDRs). Frequently, studies fail to report protein-level FDRs even though the proteins are actually the entities of interest. We provided a framework in which to discuss protein-level FDRs in a systematic manner to open up the discussion and take away potential hesitance. We also benchmarked some scalable protein inference methods and included the best one in the Percolator package. Furthermore, we added functionality to the Percolator package to accommodate the analysis of studies in which many runs are aggregated. This reduced the run time for a recent study regarding a draft human proteome from almost a full day to just 10 minutes on a commodity computer, resulting in a list of proteins together with their corresponding protein-level FDRs.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. p. vi, 44
Series
TRITA-BIO-Report, ISSN 1654-2312 ; 2016:3
Keywords
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies, simulation
National Category
Bioinformatics and Systems Biology
Research subject
Biotechnology
Identifiers
urn:nbn:se:kth:diva-185149 (URN)978-91-7595-933-7 (ISBN)
Presentation
2016-05-03, Pascal, våning 6 i Gamma-huset, Science for Life Laboratory, Tomtebodavägen 23, Solna, 13:00 (English)
Opponent
Supervisors
Note

QC 20160412

Available from: 2016-04-12 Created: 2016-04-11 Last updated: 2016-04-12Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-5401-5553

Search in DiVA

Show all publications