Antibodies are crucial for the study of human proteins and have been defined as one of the three pillars in the human chromosome-centric Human Proteome Project (CHPP). In this article the chromosome-centric structure has been used to analyze the availability of antibodies as judged by the presence within the portal Antibodypedia, a database designed to allow comparisons and scoring of publicly available antibodies toward human protein targets. This public database displays antibody data from more than one million antibodies toward human protein targets. A summary of the content in this knowledge resource reveals that there exist more than 10 antibodies to over 70% of all the putative human genes, evenly distributed over the 24 human chromosomes. The analysis also shows that at present, less than 10% of the putative human protein-coding genes (n = 1882) predicted from the genome sequence lack antibodies, suggesting that focused efforts from the antibody-based and mass spectrometry-based proteomic communities should be encouraged to pursue the analysis of these missing proteins. We show that Antibodypedia may be used to track the development of available and validated antibodies to the individual chromosomes, and thus the database is an attractive tool to identify proteins with no or few antibodies yet generated.
Ovarian cancer is usually found at a late stage when the prognosis is often bad. Relative survival rates decrease with tumor stage or grade, and the 5-year survival rate for women with carcinoma is only 38%. Thus, there is a great need to find biomarkers that can be used to carry out routine screening, especially in high-risk patient groups. Here, we present a large-scale study of 64 tissue samples taken from patients at all stages and show that we can identify statistically valid markers using nonsupervised methods that distinguish between normal, benign, borderline, and malignant tissue. We have identified 217 of the significantly changing protein spots. We are expressing and raising antibodies to 35 of these. Currently, we have validated 5 of these antibodies for use in immunohistochemical analysis using tissue microarrays of healthy and diseased ovarian, as well as other, human tissues.
Glioblastoma is the most common primary Glioblastoma Cell Surface Capturing brain tumor in adults with low average survival time after diagnosis. In order to improve glioblastoma treatment, new drug-accessible targets need to be identified. Cell surface glycoproteins are prime drug targets due to their accessibility at the surface of cancer cells. To overcome the limited availability of suitable antibodies for cell surface protein detection, we performed a comprehensive mass spectrometric investigation of the glioblastoma surfaceome. Our combined cell surface capturing analysis of primary ex vivo glioblastoma cell lines in combination with established glioblastoma cell lines revealed 633 N-glycoproteins, which vastly extends the known data of surfaceome drug targets at subcellular resolution. We provide direct evidence of common glioblastoma cell surface glycoproteins and an approximate estimate of their abundances, information that could not be derived from genomic and/or transcriptomic glioblastoma studies. Apart from our pharmaceutically valuable repertoire of already and potentially drug-accessible cell surface glycoproteins, we built a mass-spectrometry-based toolbox enabling directed, sensitive, and repetitive glycoprotein measurements for clinical follow-up studies. The included Skyline Glioblastoma SRM assay library provides an elevated starting point for parallel testing of the abundance level of the detected glioblastoma surfaceome members in future drug perturbation experiments.
For identification and characterization of proteins in complex samples, immunoenrichment coupled to mass spectrometry is a good alternative due to the sensitivity of the affinity enrichment and the specificity of mass spectrometry analysis. Antibodies are commonly used affinity agents; however, for high-throughput analysis, antibody availability is usually a bottleneck. Here we present a protocol for immunoenrichment coupled to mass spectrometry in a high-throughput setup, where all steps from bead coupling to mass spectrometry sample preparation are performed in parallel in a 96-well format. Antibodies generated within the Human Protein Atlas project were tested for applicability as capture agents. The antibodies were covalently attached to protein A beads, making it possible to reuse the coupled beads at least three times without destroying the antibody binding efficiency. Target proteins were captured from a U251 MG cell lysate, eluted, digested, and analyzed using mass spectrometry. Of 30 investigated antibodies, around 50% could successfully capture the corresponding native target protein, making the available library of more than 21 000 antibodies a valuable resource for immunoenrichment assays. Due to the diversity of different antibodies regarding affinity and specificity, analyzing antibodies in a high-throughput format is challenging. Even though protocol optimization for individual antibodies can be advantageous for future studies, our method enables a fast screening strategy to determine the usefulness of antibodies in immunoenrichment setups. In addition, we show that the specificity of the antibodies can be investigated by using label-free quantification.
Tree biotechnology will soon reach a mature state where it will influence the overall supply of fiber, energy and wood products. We are now ready to make the transition from identifying candidate genes, controlling important biological processes, to discovering the detailed molecular function of these genes on a broader, more holistic, systems biology level. In this paper, a strategy is outlined for informative data generation and integrated modeling of systematic changes in transcript, protein and metabolite profiles measured from hybrid aspen samples. The aim is to study characteristics of common changes in relation to genotype-specific perturbations affecting the lignin biosynthesis and growth. We show that a considerable part of the systematic effects in the system can be tracked across all platforms and that the approach has a high potential value in functional characterization of candidate genes.
The brain is a vital organ and because it is well shielded from the outside environment, possibilities for noninvasive analysis are often limited. Instead, fluids taken from the spinal cord or circulatory system are preferred sources for the discovery of candidate markers within neurological diseases. In the context of multiple sclerosis (MS), we applied an affinity proteomic strategy and screened 22 plasma samples with 4595 antibodies (3450 genes) on bead arrays, then defined 375 antibodies (334 genes) for targeted analysis in a set of 172 samples and finally used 101 antibodies (43 genes) on 443 plasma as well as 573 cerebrospinal spinal fluid (CSF) samples. This revealed alteration of protein profiles in relation to MS subtypes for IRF8, IL7, METTL14, SLC30A7, and GAP43. Respective antibodies were subsequently used for immunofluorescence on human post-mortem brain tissue with MS pathology for expression and association analysis. There, antibodies for IRF8, IL7, and METTL14 stained neurons in proximity of lesions, which highlighted these candidate protein targets for further studies within MS and brain tissue. The affinity proteomic translation of profiles discovered by profiling human body fluids and tissue provides a powerful strategy to suggest additional candidates to studies of neurological disorders.
One of the major challenges of a chromosome-centric proteome project is to explore in a systematic manner the potential proteins identified from the chromosomal genome sequence, but not yet characterized on a protein level. Here, we describe the use of RNA deep sequencing to screen human cell lines for RNA profiles and to use this information to select cell lines suitable for characterization of the corresponding gene product. In this manner, the subcellular localization of proteins can be analyzed systematically using antibody-based confocal microscopy. We demonstrate the usefulness of selecting cell lines with high expression levels of RNA transcripts to increase the likelihood of high quality immunofluorescence staining and subsequent successful subcellular localization of the corresponding protein. The results show a path to combine transcriptomics with affinity proteomics to characterize the proteins in a gene- or chromosome-centric manner.
The 2017 Dagstuhl Seminar on Computational Proteomics provided an opportunity for a broad discussion on ABSTRACT: The 2017 Dagstuhl Seminar on Computational the current state and future directions of the generation and use of peptide tandem mass spectrometry spectral libraries. Their use in proteomics is growing slowly, but there are multiple challenges in the field that must be addressed to further increase the adoption of spectral libraries and related techniques. The primary bottlenecks are the paucity of high quality and comprehensive libraries and the general difficulty of adopting spectral library searching into existing workflows. There are several existing spectral library formats, but none captures a satisfactory level of metadata; therefore, a logical next improvement is to design a more advanced, Proteomics Standards Initiative-approved spectral library format that can encode all of the desired metadata. The group discussed a series of metadata requirements organized into three designations of completeness or quality, tentatively dubbed bronze, silver, and gold. The metadata can be organized at four different levels of granularity: at the collection (library) level, at the individual entry (peptide ion) level, at the peak (fragment ion) level, and at the peak annotation level. Strategies for encoding mass modifications in a consistent manner and the requirement for encoding high-quality and commonly seen but as-yet-unidentified spectra were discussed. The group also discussed related topics, including strategies for comparing two spectra, techniques for generating representative spectra for a library, approaches for selection of optimal signature ions for targeted workflows, and issues surrounding the merging of two or more libraries into one. We present here a review of this field and the challenges that the community must address in order to accelerate the adoption of spectral libraries in routine analysis of proteomics datasets.
The availability of proteomics resources hosting protein and peptide standards, as well as the data describing their analytical performances, will continue to enhance our current capabilities to develop targeted proteomics methods for quantitative biology. This study describes the analysis of a resource of 26,840 individually purified recombinant protein fragments corresponding to more than 16,000 human protein-coding genes. The resource was screened to identify proteotypic peptides suitable for targeted proteomics efforts, and we report LC-MS/MS assay coordinates for more than 25,000 proteotypic peptides, corresponding to more than 10,000 unique proteins. Additionally, peptide formation and digestion kinetics were, for a subset of the standards, monitored using a time-course protocol involving parallel digestion of isotope-labeled recombinant protein standards and endogenous human plasma proteins. We show that the strategy by adding isotope-labeled recombinant proteins before trypsin digestion enables short digestion protocols (<= 60 min) with robust quantitative precision. In a proof-of-concept study, we quantified 23 proteins in human plasma using assay parameters defined in our study and used the standards to describe distinct clusters of individuals linked to different levels of LPA, APOE, SERPINAS, and TFRC. In summary, we describe the use and utility of a resource of recombinant proteins to identify proteotypic peptides useful for targeted proteomics assay development.
A new and flexible technology for high throughput analysis of antibody specificity and affinity is presented. The method is based on microfluidics and takes advantage of compact disks (CDs) in which the centrifugal force moves fluids through microstructures containing immobilized metal affinity chromatography columns. Analyses are performed as a sandwich assay, where antigen is captured to the column via a genetically attached His(6)-tag. The antibodies to be analyzed are applied onto the columns. Thereafter, fluorescently labeled secondary antibodies recognize the bound primary antibodies, and detection is carried out by laser-induced fluorescence. The CDs contain 104 microstructures enabling analysis of antibodies against more than 100 different proteins using a single CD. Importantly, through the three- dimensional visualization of the binding patterns in a column it is possible to separate high affinity from low affinity binding. The method presented here is shown to be very sensitive, flexible and reproducible.
A gene-centric Human Proteome Project has been proposed to characterize the human protein-coding genes in a chromosome-centered manner to understand human biology and disease. Here, we report on the protein evidence for all genes predicted from the genome sequence based on manual annotation from literature (UniProt), antibody-based profiling in cells, tissues and organs and analysis of the transcript profiles using next generation sequencing in human cell lines of different origins. We estimate that there is good evidence for protein existence for 69% (n = 13985) of the human protein-coding genes, while 23% have only evidence on the RNA level and 7% still lack experimental evidence. Analysis of the expression patterns shows few tissue-specific proteins and approximately half of the genes expressed in all the analyzed cells. The status for each gene with regards to protein evidence is visualized in a chromosome-centric manner as part of a new version of the Human Protein Atlas (www.proteinatlas.org).
The subcellular locations of proteins are closely related to their function and constitute an essential aspect for understanding the complex machinery of living cells. A systematic effort has been initiated to map the protein distribution in three functionally different cell lines with the aim to provide a subcellular localization index for at least one representative protein from all human protein-encoding genes. Here, we present the results of over 4,000 proteins mapped to 16 subcellular compartments. The results indicate a ubiquitous protein expression with a majority of the proteins found in all three cell lines and a large portion localized to two or more compartments. The inter-relationships between the subcellular compartments are visualized in a protein-compartment network based on all detected proteins. Hierarchical clustering was performed to determine how closely related the organelles are in terms of protein constituents and compare the proteins detected in each cell type. Our results show distinct organelle proteomes, well conserved across the cell types, and demonstrate that biochemically similar organelles are grouped together.
Human cancer cell lines grown in vitro are frequently used to decipher basic cell biological phenomena and to also specifically study different forms of cancer. Here we present the first large-scale study of protein expression patterns in cell lines using an antibody-based proteomics approach. We analyzed the expression pattern of 5436 proteins in 45 different cell lines using hierarchical clustering, principal component analysis, and two-group comparisons for the identification of differentially expressed proteins. Our results show that immunohistochemically determined protein profiles can categorize cell lines into groups that overall reflect the tumor tissue of origin and that hematological cell lines appear to retain their protein profiles to a higher degree than cell lines established from solid tumors. The two-group comparisons reveal well-characterized proteins as well as previously unstudied proteins that could be of potential interest for further investigations. Moreover, multiple myeloma cells and cells of myeloid origin were found to share a protein profile, relative to the protein profile of lymphoid leukemia and lymphoma cells, possibly reflecting their common dependency of bone marrow microenvironment. This work also provides an extensive list of antibodies, for which high-resolution images as well as validation data are available on the Human Protein Atlas (www.proteinatlas.org), that are of potential use in cell line studies.
One can interpret fragmentation spectra stemming from peptides in mass-spectrometry-based proteomics experiments using so-called database search engines. Frequently, one also runs post-processors such as Percolator to assess the confidence, infer unique peptides, and increase the number of identifications. A recent search engine, MS-GF+, has shown promising results, due to a new and efficient scoring algorithm. However, MS-GF+ provides few statistical estimates about the peptide-spectrum matches, hence limiting the biological interpretation. Here, we enabled Percolator processing for MS-GF+ output and observed an increased number of identified peptides for a wide variety of data sets. In addition, Percolator directly reports p values and false discovery rate estimates, such as q values and posterior error probabilities, for peptide-spectrum matches, peptides, and proteins, functions that are useful for the whole proteomics community.
In shotgun proteomics, the quality of a hypothesized match between an observed spectrum and a peptide sequence is quantified by a score function. Because the score function lies at the heart of any peptide identification pipeline, this function greatly affects the final results of a proteomics assay. Consequently, valid statistical methods for assessing the quality of a given score function are extremely important. Previously, several research groups have used samples of known protein composition to assess the quality of a given score function. We demonstrate that this approach is problematic, because the outcome can depend on factors other than the score function itself. We then propose an alternative use of the same type of data to validate a score function. The central idea of our approach is that database matches that are not explained by any protein in the purified sample comprise a robust representation of incorrect matches. We apply our alternative assessment scheme to several commonly used score functions, and we show that our approach generates a reproducible measure of the calibration of a given peptide identification method. Furthermore, we show how our quality test can be useful in the development of novel score functions.
In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.
The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.
Osteoarthritis (OA) is the most common rheumatic disease and one of the most disabling pathologies worldwide. To date, the diagnostic methods of OA are very limited, and there are no available medications capable of halting its characteristic cartilage degeneration. Therefore, there is a significant interest in new biomarkers useful for the early diagnosis, prognosis, and therapeutic monitoring. In the recent years, protein microarrays have emerged as a powerful proteomic tool to search for new biomarkers. In this study, we have used two concepts for generating protein arrays, antigen microarrays, and NAPPA (nucleic acid programmable protein arrays), to characterize differential autoantibody profiles in a set of 62 samples from OA, rheumatoid arthritis (RA), and healthy controls. An untargeted screen was performed on 3840 protein fragments spotted on planar antigen arrays, and 373 antigens were selected for validation on bead-based arrays. In the NAPPA approach, a targeted screening was performed on 80 preselected proteins. The autoantibody targeting CHST14 was validated by ELISA in the same set of patients. Altogether, nine and seven disease related autoantibody target candidates were identified, and this work demonstrates a combination of these two array concepts for biomarker discovery and their usefulness for characterizing disease-specific autoantibody profiles.
Enhanced by the growing number of biobanks, biomarker studies can now be performed with reasonable statistical power by using large sets of samples. Antibody-based proteomics by means of suspension bead arrays offers one attractive approach to analyze serum, plasma, or CSF samples for such studies in microtiter plates. To expand measurements beyond single batches, with either 96 or 384 samples per plate, suitable normalization methods are required to minimize the variation between plates. Here we propose two normalization approaches utilizing MA coordinates. The multidimensional MA (multi-MA) and MA-loess both consider all samples of a microtiter plate per suspension bead array assay and thus do not require any external reference samples. We demonstrate the performance of the two MA normalization methods with data obtained from the analysis of 384 samples including both serum and plasma. Samples were randomized across 96-well sample plates, processed, and analyzed in assay plates, respectively. Using principal component analysis (PCA), we could show that plate-wise clusters found in the first two components were eliminated by multi-MA normalization as compared with other normalization methods. Furthermore, we studied the correlation profiles between random pairs of antibodies and found that both MA normalization methods substantially reduced the inflated correlation introduced by plate effects. Normalization approaches using multi-MA and MA-loess minimized batch effects arising from the analysis of several assay plates with antibody suspension bead arrays. In a simulated biomarker study, multi-MA restored associations lost due to plate effects. Our normalization approaches, which are available as R package MDimNornin, could also be useful in studies using other types of high-throughput assay data.
This paper summarizes the recent activities of the Chromosome-Centric Human Proteome Project (C-HPP) consortium, which develops new technologies to identify yet-to-be annotated proteins (termed "missing proteins") in biological samples that lack sufficient experimental evidence at the protein level for confident protein identification. The C-HPP also aims to identify new protein forms that may be caused by genetic variability, post-translational modifications, and alternative splicing. Proteogenomic data integration forms the basis of the C-HPP's activities; therefore, we have summarized some of the key approaches and their roles in the project. We present new analytical technologies that improve the chemical space and lower detection limits coupled to bioinformatics tools and some publicly available resources that can be used to improve data analysis or support the development of analytical assays. Most of this paper's content has been compiled from posters, slides, and discussions presented in the series of C-HPP workshops held during 2014. All data (posters, presentations) used are available at the C-HPP Wild (http://c-hpp.webhosting.rug.nl/) and in the Supporting Information.
Automated methods for assigning peptides to observed tandem mass spectra typically return a list of peptide-spectrum matches, ranked according to an arbitrary score. In this article, we describe methods for converting these arbitrary scores into more useful statistical significance measures. These methods employ a decoy sequence database as a model of the null hypothesis, and use false discovery rate (FDR) analysis to correct for multiple testing. We first describe a simple FDR inference method and then describe how estimating and taking into account the percentage of incorrectly identified spectra in the entire data set can lead to increased statistical power.
A variety of methods have been described in the literature for assigning statistical significance to peptides identified via tandem mass spectrometry. Here, we explain how two types of scores, the q-value and the posterior error probability, are related and complementary to one another.
One year ago the Human Proteome Project (HPP) leadership designated the baseline metrics for the Human Proteome Project to be based on neXtProt with a total of 13 664 proteins validated at protein evidence level 1 (PE1) by mass spectrometry, antibody-capture, Edman sequencing, or 3D structures. Corresponding chromosome-specific data were provided from PeptideAtlas, GPMdb, and Human Protein Atlas. This year, the neXtProt total is 15 646 and the other resources, which are inputs to neXtProt, have high-quality identifications and additional annotations for 14 012 in PeptideAtlas, 14 869 in GPMdb, and 10 976 in HPA. We propose to remove 638 genes from the denominator that are "uncertain" or "dubious" in Ensembl, UniProt/SwissProt, and neXtProt. That leaves 3844 "missing proteins", currently having no or inadequate documentation, to be found from a new denominator of 19 490 protein-coding genes. We present those tabulations and web links and discuss current strategies to find the missing proteins.
One subproject within the global Chromosome 19 Consortium is to define chromosome 19 gene and protein expression in glioma-derived cancer stem cells (GSCs). Chromosome 19 is notoriously linked to glioma by 1p/19q codeletions, and clinical tests are established to detect that specific aberration. GSCs are tumor-initiating cells and are hypothesized to provide a repository of cells in tumors that can self-replicate and be refractory to radiation and chemotherapeutic agents developed for the treatment of tumors. In this pilot study, we performed RNA-Seq, label-free quantitative protein measurements in six GSC lines, and targeted transcriptomic analysis using a chromosome 19-specific microarray in an additional six GSC lines. The data have been deposited to the ProteomeXchange with identifier PXD000563. Here we present insights into differences in GSC gene and protein expression, including the identification of proteins listed as having no or low evidence at the protein level in the Human Protein Atlas, as correlated to chromosome 19 and GSC subtype. Furthermore, the upregulation of proteins downstream of adenovirus-associated integration site 1 (AAVS1) in GSC11 in response to oncolytic adenovirus treatment was demonstrated. Taken together, our results may indicate new roles for chromosome 19, beyond the 1p/19q codeletion, in the future of personalized medicine for glioma patients.
RUNX2, a gene involved in skeletal development, has previously been shown to be potentially affected by positive selection during recent human evolution. Here we have used antibody-based proteomics to characterize potential differences in expression patterns of RUNX2 interacting partners during primate evolution. Tissue microarrays consisting of a large set of normal tissues from human and macaque were used for protein profiling of 50 RUNX2 partners with immunohistochemistry. Eleven proteins (AR, CREBBP, EP300, FGF2, HDAC3, JUN, PRKD3, RUNX1, SATB2, TCF3, and YAP1) showed differences in expression between humans and macaques. These proteins were further profiled in tissues from chimpanzee, gorilla, and orangutan, and the corresponding genes were analyzed with regard to genomic features. Moreover, protein expression data were compared with previously obtained RNA sequencing data from six different organs. One gene (TCF3) showed significant expression differences between human and macaque at both the protein and RNA level, with higher expression in a subset of germ cells in human testis compared with macaque. In conclusion, normal tissues from macaque and human showed differences in expression of some RUNX2 partners that could be mapped to various defined cell types. The applied strategy appears advantageous to characterize the consequences of altered genes selected during evolution.
Claudius are the major transmembrane protein components of tight junctions in human endothelia and epithelia. Tissue-specific expression of claudin members suggests that this protein family is not only essential for sustaining the role of tight junctions in cell permeability control but also vital in organizing cell contact signaling by protein protein interactions. How this protein family is collectively processed and regulated is key to understanding the role of junctional proteins in preserving cell identity and tissue integrity. The focus of this review is to first provide a brief overview of the functional context, on the basis of the extensive body of claudin biology research that has been thoroughly reviewed, for endogenous human claudin members and then ascertain existing and future proteomics techniques that may be applicable to systematically characterizing the chemical forms and interacting protein partners of this protein family in human. The ability to elucidate claudin-based signaling networks may provide new insight into cell development and differentiation programs that are crucial to tissue stability and manipulation.
We report progress assembling the parts list for chromosome 17 and illustrate the various processes that we have developed to integrate available data from diverse genomic and proteomic knowledge bases. As primary resources, we have used GPMDB, neXtProt, PeptideAtlas, Human Protein Atlas (HPA), and GeneCards. All sites share the common resource of Ensembl for the genome modeling information. We have defined the chromosome 17 parts list with the following information: 1169 protein-coding genes, the numbers of proteins confidently identified by various experimental approaches as documented in GPMDB, neXtProt, PeptideAtlas, and HPA, examples of typical data sets obtained by RNASeq and proteomic studies of epithelial derived tumor cell lines (disease proteome) and a normal proteome (peripheral mononuclear cells), reported evidence of post-translational modifications, and examples of alternative splice variants (ASVs). We have constructed a list of the 59 "missing" proteins as well as 201 proteins that have inconclusive mass spectrometric (MS) identifications. In this report we have defined a process to establish a baseline for the incorporation of new evidence on protein identification and characterization as well as related information from transcriptome analyses. This initial list of "missing" proteins that will guide the selection of appropriate samples for discovery studies as well as antibody reagents. Also we have illustrated the significant diversity of protein variants (including post-translational modifications, PTMs) using regions on chromosome 17 that contain important oncogenes. We emphasize the need for mandated deposition of proteomics data in public databases, the further development of improved PTM, ASV, and single nucleotide variant (SNV) databases, and the construction of Web sites that can integrate and regularly update such information. In addition, we describe the distribution of both clustered and scattered sets of protein families on the chromosome. Since chromosome 17 is rich in cancer-associated genes, we have focused the clustering of cancer-associated genes in such genomic regions and have used the ERBB2 amplicon as an example of the value of a proteogenomic approach in which one integrates transcriptomic with proteomic information and captures evidence of coexpression through coordinated regulation.
White adipose tissue (WAT) has a major role in the progression of obesity. Here, we combined data from RNA-Seq and antibody-based immunohistochemistry to describe the normal physiology of human WAT obtained from three female subjects and explored WAT-specific genes by comparing WAT to 26 other major human tissues. Using the protein evidence in WAT, we validated the content of a genome-scale metabolic model for adipocytes. We employed this high-quality model for the analysis of subcutaneous adipose tissue (SAT) gene expression data obtained from subjects included in the Swedish Obese Subjects Sib Pair study to reveal molecular differences between lean and obese individuals. We integrated SAT gene expression and plasma metabolomics data, investigated the contribution of the metabolic differences in the mitochondria of SAT to the occurrence of obesity, and eventually identified cytosolic branched-chain amino acid (BCAA) transaminase 1 as a potential target that can be used for drug development. We observed decreased glutaminolysis and alterations in the BCAAs metabolism in SAT of obese subjects compared to lean subjects. We also provided mechanistic explanations for the changes in the plasma level of BCAAs, glutamate, pyruvate, and alpha-ketoglutarate in obese subjects. Finally, we validated a subset of our model-based predictions in 20 SAT samples obtained from 10 lean and 10 obese male and female subjects.
Efficiently and accurately analyzing big protein tandem mass spectrometry data sets requires robust software that incorporates state-of-the-art computational, machine learning, and statistical methods. The Crux mass spectrometry analysis software toolkit (http://cruxtoolkit.sourceforge.net) is an open source project that aims to provide users with a cross-platform suite of analysis tools for interpreting protein mass spectrometry data.
In typical shotgun experiments, the mass spectrometer records the masses of a large set of ionized analytes but fragments only a fraction of them. In the subsequent analyses, normally only the fragmented ions are used to compile a set of peptide identifications, while the unfragmented ones are disregarded. In this work, we show how the unfragmented ions, here denoted MS1-features, can be used to increase the confidence of the proteins identified in shotgun experiments. Specifically, we propose the usage of in silico mass tags, where the observed MS1-features are matched against de novo predicted masses and retention times for all peptides derived from a sequence database. We present a statistical model to assign protein-level probabilities based on the MS1-features and combine this data with the fragmentation spectra. Our approach was evaluated for two triplicate data sets from yeast and human, respectively, leading to up to 7% more protein identifications at a fixed protein-level false discovery rate of 1%. The additional protein identifications were validated both in the context of the mass spectrometry data and by examining their estimated transcript levels generated using RNA-Seq. The proposed method is reproducible, straightforward to apply, and can even be used to reanalyze and increase the yield of existing data sets.
Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Most notably such predictions are used to weed out incorrect peptide-spectrum matches, and to design targeted proteomics experiments. In this study, we describe a RT predictor, ELUDE, which can be employed in both applications. ELUDE's predictions are based on 60 features derived from the peptide's amino acid composition and optimally combined using kernel regression. When sufficient data is available, ELUDE derives a retention time index for the condition at hand making it fully portable to new chromatographic conditions. In cases when little training data is available, as often is the case in targeted proteomics experiments, ELUDE selects and calibrates a model from a library of pretrained predictors. Both model selection and calibration are carried out via robust statistical methods and thus ELUDE can handle situations where the calibration data contains erroneous data points. We benchmarked our method against two state-of-the-art predictors and showed that ELUDE outperforms these methods and tracked up to 34% more peptides in a theoretical SRM method creation experiment. ELUDE is freely available under Apache License from http://per-colator.com.
There is a need for reliable and sensitive biomarkers for renal impairments to detect early signs of kidney toxicity and to monitor progression of disease. Here, antibody suspension bead arrays were applied to profile plasma samples from patients with four types of kidney disorders: glomerulonephritis, diabetic nephropathy, obstructive uropathy, and analgesic abuse. In total, 200 clinical renal-associated cases and control plasma samples from different cohorts were profiled. Parallel plasma protein profiles were obtained using biotinylated and nonfractionated samples and a selected set of 94 proteins targeted by 129 antigen-purified polyclonal antibodies. Out of the analyzed target proteins, human fibulin-1 was detected at significantly higher levels in the glomerulonephritis patient group compared to the controls and with elevated levels in patient samples for all other renal disorders investigated. Two polyclonal antibodies and one monoclonal antibody directed toward separate, nonoverlapping epitopes showed the same trend in the discovery cohorts. A technical verification using Western blot analysis of selected patient plasma confirmed the trends toward higher abundance of the target protein in disease samples. Furthermore, a verification study was carried out in the context of glomerulonephritis using an independent case and control cohort, and this confirmed the results from the discovery cohort, suggesting that plasma levels of fibulin-1 could serve as a potential indicator to monitor kidney malfunction or kidney damage.
A first research development progress report of the Chromosome 19 Consortium with members from Sweden, Norway, Spain, United States, China and India, a part of the Chromosome-centric Human Proteome Project (C-HPP) global initiative, is presented (http://www.c-hpp.org). From the chromosome 19 peptide-targeted library constituting 6159 peptides, a pilot study was conducted using a subset with 125 isotope-labeled peptides. We applied an annotation strategy with triple quadrupole, ESI-Qtrap, and MALDI mass spectrometry platforms, comparing the quality of data within and in between these instrumental set-ups. LC-MS conditions were outlined by multiplex assay developments, followed by MRM assay developments. SRM was applied to biobank samples, quantifying kallikrein 3 (prostate specific antigen) in plasma from prostate cancer patients. The antibody production has been initiated for more than 1200 genes from the entire chromosome 19, and the progress developments are presented. We developed a dedicated transcript microarray to serve as the mRNA identifier by screening cancer cell lines. NAPPA protein arrays were built to align with the transcript data with the Chromosome 19 NAPPA chip, dedicated to 90 proteins, as the first development delivery. We have introduced an IT-infrastructure utilizing a LIMS system that serves as the key interface for the research teams to share and explore data generated within the project. The cross-site data repository will form the basis for sample processing, including biological samples as well as patient samples from national Biobanks.
We describe the utility of integrated strategies that employ both translation of ENCODE data and major proteomic technology pillars to improve the identification of the "missing proteins", novel proteoforms, and PTMs. On one hand, databases in combination with bioinformatic tools are efficiently utilized to establish microarray-based transcript analysis and supply rapid protein identifications in clinical samples. On the other hand, sequence libraries are the foundation of targeted protein identification and quantification using mass spectrometric and immunoaffinity techniques. The results from combining proteoENCODEdb searches with experimental mass spectral data indicate that some alternative splicing forms detected at the transcript level are in fact translated to proteins. Our results provide a step toward the directives of the C-HPP initiative and related biomedical research.
The HUPO Human Proteome Project (HP?) has two overall goals: (1) stepwise completion of the protein parts-list the draft human proteome including confidently identifying and characterizing at least one protein product from each protein-coding gene, with increasing emphasis on sequence variants, post-translational modifications (PTMs), and splice isoforms of those proteins; and (2) making proteomics an integrated counterpart to genomics throughout the biomedical and life sciences community. PeptideAtlas and GPMDB reanalyze all major human mass spectrometry data sets available through ProteomeXchange with standardized protocols and stringent quality filters; neXtProt curates and integrates mass spectrometry and other findings to present the most up to date authorative compendium of the human proteome. The HPP Guidelines for Mass Spectrometry Data Interpretation version 2.1 were applied to manuscripts submitted for this 2016 C-HPP-led special issue [www.thehpp.org/guidelines]. The Human Proteome presented as neXtProt version 2016-02 has 16,518 confident protein identifications (Protein Existence [PE] Level 1), up from 13,664 at 2012-12, 15,646 at 2013-09, and 16,491 at 2014-10. There are 485 proteins that would have been PEI under the Guidelines v1.0 from 2012 but now have insufficient evidence due to the agreed-upon more stringent Guidelines v2.0 to reduce false positives. neXtProt and PeptideAtlas now both require two non-nested, uniquely mapping (proteotypic) peptides of at least 9 as in length. There are 2,949 missing proteins (PE2+3+4) as the baseline for submissions for this fourth annual C-HPP special issue of Journal of Proteome Research. PeptideAtlas has 14,629 canonical (plus 1187 uncertain and 1755 redundant) entries. GPMDB has 16,190 EC4 entries, and the Human Protein Atlas has 10,475 entries with supportive evidence. neXtProt, PeptideAtlas, and GPMDB are rich resources of information about post-translational modifications (PTMs), single amino acid variants (SAAVSs), and splice isoforms. Meanwhile, the Biology- and Disease-driven (B/D)-HPP has created comprehensive SRM resources, generated popular protein lists to guide targeted proteomics assays for specific diseases, and launched an Early Career Researchers initiative.
Remarkable progress continues on the annotation of the proteins identified in the Human Proteome and on finding credible proteomic evidence for the expression of "missing proteins". Missing proteins are those with no previous protein-level evidence or insufficient evidence to make a confident identification upon reanalysis in PeptideAtlas and curation in neXtProt. Enhanced with several major new data sets published in 2014, the human proteome presented as neXtProt, version 2014-09-19, has 16 491 unique confident proteins (PE level I), up from 13 664 at 2012-12 and 15 646 at 2013-09. That leaves 2948 missing proteins from genes classified having protein existence level PE 2, 3, or 4, as well as 616 dubious proteins at PE 5. Here, we document the progress of the HPP and discuss the importance of assessing the quality of evidence, confirming automated findings and considering alternative protein matches for spectra and peptides. We provide guidelines for proteomics investigators to apply in reporting newly identified proteins.
The Human Proteome Organization (HUPO) Human Proteome Project (HPP) continues to make progress on its two overall goals: (1) completing the protein parts list, with an annual update of the HUPO draft human proteome, and (2) making proteomics an integrated complement to genomics and transcriptomics throughout biomedical and life sciences research. neXtProt version 2017-01-23 has 17 008 confident protein identifications (Protein Existence [PE] level 1) that are compliant with the HPP Guidelines v2.1 (https://hupo.org/Guidelines), up from 13 664 in 2012-12 and 16 518 in 2016-04. Remaining to be found by mass spectrometry and other methods are 2579 "missing proteins" (PE2+3+4), down from 2949 in 2016. PeptideAtlas 2017-01 has 15 173 canonical proteins, accounting for nearly all of the 15 290 PE1 proteins based on MS data. These resources have extensive data on PTMs, single amino acid variants, and splice isoforms. The Human Protein Atlas v16 has 10 492 highly curated protein entries with tissue and subcellular spatial localization of proteins and transcript expression. Organ-specific popular protein lists have been generated for broad use in quantitative targeted proteomics using SRM-MS or DIA-SWATH-MS studies of biology and disease.
The Human Proteome Project (HPP) annually reports on progress throughout the field in credibly identifying and characterizing the human protein parts list and making proteomics an integral part of multiomics studies in medicine and the life sciences. NeXtProt release 2018-01-17, the baseline for this sixth annual HPP special issue of the Journal of Proteome Research, contains 17 470 PE1 proteins, 89% of all neXtProt predicted PE1-4 proteins, up from 17 008 in release 2017-01-23 and 13 975 in release 2012-02-24. Conversely, the number of neXtProt PE2,3,4 missing proteins has been reduced from 2949 to 2579 to 2186 over the past two years. Of the PEI proteins, 16 092 are based on mass spectrometry results, and 1378 on other kinds of protein studies, notably protein protein interaction findings. PeptideAtlas has 15 798 canonical proteins, up 625 over the past year, including 269 from SUMOylation studies. The largest reason for missing proteins is low abundance. Meanwhile, the Human Protein Atlas has released its Cell Atlas, Pathology Atlas, and updated Tissue Atlas, and is applying recommendations from the International Working Group on Antibody Validation. Finally, there is progress using the quantitative multiplex organ-specific popular proteins targeted proteomics approach in various disease categories.
The objective of the international Chromosome-Centric Human Proteome Project (C-HPP) is to map and annotate all proteins encoded by the genes on each human chromosome. The C-FIPP consortium was established to organize a collaborative network among the research teams responsible for protein mapping of individual chromosomes and to identify compelling biological and genetic mechanisms influencing colocated genes and their protein products. The C-HPP aims to foster the development of proteome analysis and integration of the findings from related molecular -omics technology platforms through collaborations among universities, industries, and private research groups. The C-HPP consortium leadership has elicited broad input for standard guidelines to manage these international efforts more efficiently by mobilizing existing resources and collaborative networks. The C-HPP guidelines set out the collaborative consensus of the C-HPP teams, introduce topics associated with experimental approaches, data production, quality control, treatment, and transparency of data, governance of the consortium, and collaborative benefits. A companion approach for the Biology and Disease-Driven HPP (B/D-HPP) component of the Human Proteome Project is currently being organized, building upon the Human Proteome Organization's organ-based and biofluid-based initiatives (www.hupo.org/research). The common application of these guidelines in the participating laboratories is expected to facilitate the goal of a comprehensive analysis of the human proteome.
Mass spectrometry, the core technology in the field of proteomics, promises to enable scientists to identify and quantify the entire complement of proteins in a complex biological sample. Currently, the primary bottleneck in this type of experiment is computational. Existing algorithms for interpreting mass spectra are slow and fail to identify a large proportion of the given spectra. We describe a database search program called Crux that reimplements and extends the widely used database search program Sequest. For speed, Crux uses a peptide indexing scheme to rapidly retrieve candidate peptides for a given spectrum. For each peptide in the target database, Crux generates shuffled decoy peptides on the fly, providing a good null model and, hence, accurate false discovery rate estimates. Crux also implements two recently described postprocessing methods: a p value calculation based upon fitting a Weibull distribution to the observed scores, and a semisupervised method that learns to discriminate between target and decoy matches. Both methods significantly improve the overall rate of peptide identification. Crux is implemented in C and is distributed with source code freely to noncommercial users.
There is a demand for novel targets and approaches to diagnose and treat prostate cancer (PCA). In this context, serum and plasma samples from a total of 609 individuals from two independent patient cohorts were screened for IgG reactivity against a sum of 3833 human protein fragments. Starting from planar protein arrays with 3786 protein fragments to screen 80 patients with and without PCA diagnosis, 161 fragments (4%) were chosen for further analysis based on their reactivity profiles. Adding 71 antigens from literature, the selection of antigens was corroborated for their reactivity in a set of 550 samples using suspension bead arrays. The antigens prostein (SLC45A3), TATA-box binding protein (TBP), and insulin-like growth factor 2 mRNA binding protein 2 (IGF2BP2) showed higher reactivity in PCA patients with late disease compared with early disease. Because of its prostate tissue specificity, we focused on prostein and continued with mapping epitopes of the 66-mer protein fragment using patient samples. Using bead-based assays and 15-mer peptides, a minimal peptide epitope was identified and refined by alanine scanning to the KPxAPFP. Further sequence alignment of this motif revealed homology to transmembrane protein 79 (TMEM79) and TGF-beta-induced factor 2 (TGIF2), thus providing a reasoning for cross-reactivity found in females. A comprehensive workflow to discover and validate IgG reactivity against prostein and homologous targets in human serum and plasma was applied. This study provides useful information when searching for novel biomarkers or drug targets that are guided by the reactivity of the immune system against autoantigens.
One of the most complex organs in the human body is the testis, where spermatogenesis takes place. This physiological process involves thousands of genes and proteins that are activated and repressed, making testis the organ with the highest number of tissue-specific genes. However, the function of a large proportion of the corresponding proteins remains unknown and testis harbors many missing proteins (MPs), defined as products of protein-coding genes that lack experimental mass spectrometry evidence. Here, an integrated omics approach was used for exploring the cell type-specific protein expression of genes with an elevated expression in testis. By combining genome-wide transcriptomics analysis with immunohistochemistry, more than 500 proteins with distinct testicular protein expression patterns were identified, and these were selected for in-depth characterization of their in situ expression in eight different testicular cell types. The cell type-specific protein expression patterns allowed us to identify six distinct clusters of expression at different stages of spermatogenesis. The analysis highlighted numerous poorly characterized proteins in each of these clusters whose expression overlapped with that of known proteins involved in spermatogenesis, including 88 proteins with an unknown function and 60 proteins that previously have been classified as MPs. Furthermore, we were able to characterize the in situ distribution of several proteins that previously lacked spatial information and cell type-specific expression within the testis. The testis elevated expression levels both at the RNA and protein levels suggest that these proteins are related to testis-specific functions. In summary, the study demonstrates the power of combining genome-wide transcriptomics analysis with antibody-based protein profiling to explore the cell type-specific expression of both well-known proteins and MPs. The analyzed proteins constitute important targets for further testis-specific research in male reproductive disorders. Copyright
The importance of the ligand presentation format for the production of protein capture microarrays was evaluated using different Affibody molecules, produced either as single 6 kDa monomers or genetically linked head-to-tail multimers containing up to four domains. The performances in terms of selectivity and sensitivity of the monomeric and the multidomain Affibody molecules were compared by immobilization of the ligands on microarray slides, followed by incubation with fluorescent-labeled target protein. An increase in signal intensities for the multimers was demonstrated, with the most pronounced difference observed between monomers and dimers. A protein microarray containing six different dimeric Affibody ligands with specificity for IgA, IgE, IgG, TNF-alpha, insulin, or Taq DNA polymerase was characterized for direct detection of fluorescent-labeled analytes. No cross-reactivity was observed and the limits of detection were 600 fM for IgA, 20 pM for IgE, 70 fM for IgG, 20 pM for TNF-alpha, 60 pM for insulin, and 10 pM for Taq DNA polymerase. Also, different sandwich formats for detection of unlabeled protein were evaluated and used for selective detection of IgA or TNF-alpha in human serum or plasma samples, respectively. Finally, the presence of IgA was determined using detection of directly Cy5-labeled normal or IgA-deficient serum samples.
Policies supporting the rapid and open sharing of genomic data have directly fueled the accelerated pace of discovery in large-scale genomics research. The proteomics community is starting to implement analogous policies and infrastructure for making large-scale proteomics data widely available on a precompetitive basis. On August 14, 2008, the National Cancer Institute (NCI) convened the "International Summit on Proteomics Data Release and Sharing Policy" in Amsterdam, The Netherlands, to identify and address potential roadblocks to rapid and open access to data. The six principles agreed upon by key stakeholders at the summit addressed issues surrounding (1) timing, (2) comprehensiveness, (3) format, (4) deposition to repositories, (5) quality metrics, and (6) responsibility for proteomics data release. This summit report explores various approaches to develop a framework of data release and sharing principles that will most effectively fulfill the needs of the funding agencies and the research community.
Antibody microarrays offer a powerful tool to screen for target proteins in complex samples. Here, we describe an approach for systematic analysis of serum, based on antibodies and using color-coded beads for the creation of antibody arrays in suspension. This method, adapted from planar antibody arrays, offers a fast, flexible, and multiplexed procedure to screen larger numbers of serum samples, and no purification steps are required to remove excess labeling substance. The assay system detected proteins down to lower picomolar levels with dynamic ranges over 3 orders of magnitude. The feasibility of this workflow was shown in a study with more than 200 clinical serum samples tested for 20 serum proteins.
Human blood plasma provides a highly accessible window to the proteome of any individual in health and disease. Since its inception in 2002, the Human Proteome Organization's Human Plasma Proteome Project (HPPP) has been promoting advances in the study and understanding of the full protein complement of human plasma and on determining the abundance and modifications of its components. In 2017, we review the history of the HPPP and the advances of human plasma proteomics in general, including several recent achievements. We then present the latest 2017-04 build of Human Plasma PeptideAtlas, which yields ∼43 million peptide-spectrum matches and 122,730 distinct peptide sequences from 178 individual experiments at a 1% protein-level FDR globally across all experiments. Applying the latest Human Proteome Project Data Interpretation Guidelines, we catalog 3509 proteins that have at least two non-nested uniquely mapping peptides of nine amino acids or more and >1300 additional proteins with ambiguous evidence. We apply the same two-peptide guideline to historical PeptideAtlas builds going back to 2006 and examine the progress made in the past ten years in plasma proteome coverage. We also compare the distribution of proteins in historical PeptideAtlas builds in various RNA abundance and cellular localization categories. We then discuss advances in plasma proteomics based on targeted mass spectrometry as well as affinity assays, which during early 2017 target ∼2000 proteins. Finally, we describe considerations about sample handling and study design, concluding with an outlook for future advances in deciphering the human plasma proteome.
Arbitrary cutoffs are ubiquitous in quantitative computational proteomics: maximum acceptable MS/MS PSM or peptide q value, minimum ion intensity to calculate a fold change, the minimum number of peptides that must be available to trust the estimated protein fold change (or the minimum number of PSMs that must be available to trust the estimated peptide fold change), and the "significant" fold change cutoff. Here we introduce a novel experimental setup and nonparametric Bayesian algorithm for determining the statistical quality of a proposed differential set of proteins or peptides. By comparing putatively nonchanging case-control evidence to an empirical null distribution derived from a control-control experiment, we successfully avoid some of these common parameters. We then apply our method to evaluating different fold-change rules and find that for our data a 1.2-fold change is the most permissive of the plausible fold-change rules.
In any high-throughput scientific study, it is often essential to estimate the percent of findings that are actually incorrect. This percentage is called the false discovery rate (abbreviated "FDR"), and it is an invariant (albeit, often unknown) quantity for any well-formed study. In proteomics, it has become common practice to incorrectly conflate the protein FDR (the percent of identified proteins that are actually absent) with protein-level target-decoy, a particular method for estimating the protein-level FDR. In this manner, the challenges of one approach have been used as the basis for an argument that the field should abstain from protein-level FDR analysis altogether or even the suggestion that the very notion of a protein FDR is flawed. As we demonstrate in simple but accurate simulations, not only is the protein-level FDR an invariant concept, when analyzing large data sets, the failure to properly acknowledge it or to correct for multiple testing can result in large, unrecognized errors, whereby thousands of absent proteins (and, potentially every protein in the FASTA database being considered) can be incorrectly identified.
Parsimony and protein grouping are widely employed to enforce economy in the number of identified proteins, with the goal of increasing the quality and reliability of protein identifications; however, in a counterintuitive manner, parsimony and protein grouping may actually decrease the reproducibility and interpretability of protein identifications. We present a simple illustration demonstrating ways in which parsimony and protein grouping may lower the reproducibility or interpretability of results. We then provide an example of a data set where a probabilistic method increases the reproducibility and interpretability of identifications made on replicate analyses of Human Du145 prostate cancer cell lines.
Osteoarthritis (OA) is one of the most prevalent articular diseases. The identification of proteins closely associated with the diagnosis, progression, prognosis, and treatment response is dramatically required for this pathology. In this work, differential serum protein profiles have been identified in OA and rheumatoid arthritis (RA) by antibody arrays containing 151 antibodies against 121 antigens in a cohort of 36 samples. Then the identified differential serum protein profiles have been validated in a larger cohort of 282 samples. The overall immunoreactivity is higher in the pathological situations in comparison with the controls. Several proteins have been identified as biomarker candidates for OA and RA. Most of these biomarker candidates are proteins related to inflammatory response, lipid metabolism, or bone and extracellular matrix formation, degradation, or remodeling.