12  Proteomics and Mass Spectrometry

Author
Affiliation

Dr Randy Johnson

Hood College

Published

October 15, 2025

Acknowledgements

Much of the information here comes from a great review article by Chen et al. (2020). NotebookLM was also used to collect and summarize information from many of the articles referenced in the review.

Introduction to Proteomics & Mass Spectrometry

  • Proteomics: the study, identification, quantification, and characterization of all proteins in a cell or organism (the proteome).

Proteomics: The large-scale study of proteins

  • Proteins are fundamental molecules involved in nearly all biological processes, including:

    • Structure

    • Metabolism

    • Signaling

    • Gene regulation

    • Immune function

Why Study Proteins Directly?

  • Central Dogma: DNA \(\to\) RNA \(\to\) Proteins

  • Transcriptome data (mRNA abundance) is often insufficient to reliably infer protein abundance (Anderson and Seilhamer 1997)

Mass Spectrometry (MS)

  • Recent advancements in separation and MS technology allow complex biological systems to be studied as integrated units

  • MS revolutionized proteomics by offering high sensitivity protein identification in complex mixtures

  • Mass Spectrometry determines the mass-to-charge ratio (\(m/z\)) of ions

MALDI-TOF

  • MALDI-TOF is a common approach in high throughput proteomics
Note
  • MALDI: Matrix-Assisted Laser Desorption/Ionization
  • TOF: Time-of-Flight
  • Samples are vaporized and passed through a high-energy field to ionize proteins
  • The time it takes peptides to travel the length of the MS column is used to determine the size of the peptide

Tandem Mass Spectrometry (MS/MS)

  • Involves selecting peptides and fragmenting them by collision
  • Helps achieve sufficient resolution for precise \(m/z\) determination

MS Strategies and Protein Identification

  • Two general experimental strategies:

    • Bottom-Up Proteomics (Most Common): Protein samples are proteolytically digested into peptides prior to MS analysis

    • Top-Down Proteomics: Intact proteins are analyzed directly by MS

Note: we will focus on bottom-up proteomics in this document

Data acquisition for bottom-up proteomics

  • Proteolytic digestion into smaller peptides

    • Required before it can be analyzed in the mass spectrometer
  • Trypsin is the most common enzyme used to degrade proteins into peptides

    • Cuts on the C-terminus of lysine or arginine (unless followed by a proline)
    • Other enzymes can be used but are typically less reliable, making downstream analysis more complex
  • Digestion results in peptides that are easier to distinguish by MS

  • Digested peptides are then fractionated to reduce the complexity of the sample

    • This is often performed using liquid chromatography (LC)
    • Liquid solvent caries the sample through a column
    • Stationary phase (e.g. consisting of beads or resin) interacts with and slows down some peptides
    • Sample is eluted and collected in a series of tubes
    • Smaller simpler subsets in each elution group are easier to distinguish, especially lower-abundance peptides

General bottom-up workflow

Following data acquisition:

  • Raw MS data processing: Identification and quantification of peptides and proteins.

  • Downstream analysis:

    • Data preprocessing
    • Statistical analysis
    • Enrichment analysis

Peptide and protein identification

The objective is to determine the sequence of peptides from the fragmentation spectra.

  • De novo peptide sequencing:
    • Analysis of MS/MS spectra to generate short sequence tags
  • Database matching:
    • Compare sequence tags to a database of known protein sequences
    • This narrows the search space
  • Protein inference:
    • Computationally intensive analysis of the search space for high-confidence calls

Database matching

  • In silico digestion of protein sequences creates a target database

  • Peptide Spectrum Match (PSM) score measures the similarity between experimental and theoretical spectra

  • Selecting appropriate precursor and fragment mass tolerances is vital

    • Tolerances too wide increase false PSMs
    • Too narrow and we fail to identify PSMs

False discovery rate (FDR)

  • Target-Decoy Strategy:
    • Performed after database searching
    • Searching against a decoy database (e.g. reversed or shuffled sequences)
    • Estimates a threshold for removing identifications that lack statistical confidence
    • Increases the percentage of true positive hits.
  • As protein sequence databases expand, the target-decoy strategy becomes computationally inefficient

DDA vs DIA

Data-Dependent Acquisition (DDA):

  • Performs a full MS1 scan and collects the N most intense precursor ions
  • Those are passed to the MS2 scan for fragmentation
  • This approach is biased in that it filters out low abundance results

Data-Independent Acquisition (DIA):

  • All ions are fragmented and detected simultaneously within predefined \(m/z\) ranges
  • Offers less biased results
  • Downstream analysis is more complex

Protein inference

  • Problem: Degenerate peptides
    • Shorter peptides may be shared by multiple proteins
    • Results in multiple optimal solutions for protein reconstruction
  • Possible solutions:
    • Parsimonious rule: Use the smallest set of proteins required to account for all detected peptides
    • Probabilistic models
    • FDR estimation: extension of the target-decoy strategy to the protein level to establish stringent cutoffs

Protein Quantification Strategies

No single standard analysis pipeline exists because differences in upstream experimental techniques require different approaches.

  • Labeled methods (Isotope-based)
  • Label-free methods (Based on signal intensity/spectral counts)

Labeled quantification: MS1-based

  • Different samples are tagged with distinct isotopes
  • Differently labeled samples are pooled before digestion
  • Peptides from both samples have identical chemical properties and co-elute during fractionation
  • Isotopes result in a double peak during MS1
  • Peak intensities give us a measure of abundance ratios between samples
  • MS2 detection gives us the peptide’s amino acid sequence

Labeled quantification: MS2-based (Isobaric labeling)

  • Different samples are tagged with chemical groups (e.g. Tandem Mass Tagging or TMT) containing

    • Reporter group: isotope with unique mass for each sample
    • Balance group: isotope that balances the differences in reporter group masses
    • Reporter group mass + balance group mass is the same (isobaric) for all samples
  • Differently labeled samples are pooled before digestion

  • Peptides from both samples have identical chemical properties and co-elute during fractionation

  • Only a single peak per peptide is detected during MS1

  • Reporter ions break off during MS2, resulting in different peaks for each sample

  • MS2 detection also gives us the peptide’s amino acid sequence

  • Pos:

    • Achieves relatively large multiplexing capacity (up to 11 samples in one run)
  • Cons:

    • Prone to ratio compression which occurs when biologically interesting peptides co-elute with uninteresting peptides (i.e. peptides with ratios close to 1:1). Noise from uninteresting peptides can skew results

Label-free quantification (LFQ)

  • Samples are run in separate LC-MS/MS experiments
  • Quantification relies primarily on analyzing the signal intensity of peptide precursor ions
  • Normalization is required to address variability

Label-free vs Labeled quantification

  • Label-free quantification is
    • Cheaper
    • Quicker
    • More flexible (not all samples are able to be labeled)
    • Not limited by multiplexing constraints
    • More sensitive (i.e. because we run each sample separately)
  • Labeled quantification
    • Has less run-to-run technical variability
    • Requires fewer biological and technical replicates
    • More accurate abundance ratios

Imaging mass spectrometry (IMS)

  • Typically used for a thin tissue section (similar to spatial transcriptomics)
  • Matrix-Assisted Laser Desorption/Ionization (MALDI) is used to sequentially vaporize specific spots on the tissue section
  • Spatial location of proteins is included in the output

Bioinformatics Pipelines: Data Preprocessing & Analysis

Preprocessing

  • Protein abundance is often inferred from a small number of peptides

  • Strict cutoffs for minimum peptide numbers must be applied to ensure reliable inference

  • Preprocessing and normalization are remove non-biological variations and produce reliable, comparable results

  • Inadequate normalization is a common issue

  • Protein abundance distributions are often skewed toward zero

    • Log transformation is common

Normalization methods

  • Variance stabilization normalization:
    • Aims to eliminate dependency between variances and mean abundances
    • VSN generally performs consistently well in differential abundance analysis
  • Linear regression-based:
    • Assumes a linear dependency between measurement bias and protein abundance
  • Local regression:
    • Assumes a nonlinear dependency (e.g. local polynomial regression)

Missing values

  • Missing values are common due to the stochasticity in sampling, especially for proteins at low concentration

  • Machine learning models can be used to estimate (impute) missing data points

Statistical analysis

  • T-test (two groups) and ANOVA (two or more factors) are commonly used

  • Statistical power is usually an issue

    • Cost in terms of money, time, technical complexity and sample availability leads to small sample sizes
  • Empirical Bayes procedure employed in LIMMA can be used for

    • Reduction of measurement variances in pooled estimates
    • More robust results, especially on smaller proteomic datasets

Multiple hypothesis testing

  • Since thousands of hypotheses (proteins) are tested simultaneously, FDR must be controlled
    • Benjamini–Hochberg procedure.
    • FDR estimation from permutation.

Enrichment analysis: Identifying functional groups

  • Proteomics allows hypothesis testing on systemic measurements that incorporate information like PTMs (post-transcriptional modifications)

  • Requires converting protein identifiers to gene names and functional annotation

Gene Ontology (GO) enrichment

  • GO enrichment uses a structured vocabulary to characterize protein functions

  • Three main categories:

    • Biological Process
    • Molecular Function
    • Cellular Component
  • GO prediction: For proteins lacking full annotation, informative GO terms from proteins with similar sequences to predict functions

Pathway and protein set enrichment

  • Biological pathway analysis: Uses prior knowledge about regulatory pathway networks and diseases.

  • Pathway databases:

    • KEGG (Kyoto Encyclopedia of Genes and Genomes)
    • Reactome
    • PANTHER
  • Protein Set Enrichment Analysis (PSEA)

PTM and interaction databases

  • PTM databases: Curate detailed information on modification type, position, and disease relations

  • Protein-Protein Interaction (PPI) databases: Used to predict general interaction networks

Module 5: Advanced Bioinformatics & Future Directions

Machine learning and artificial intelligence (ML/AI)

  • ML/AI extracts informative features from large proteomics datasets to construct models for prediction (annotations, locations, clinical outcomes).
  • Supervised learning builds models to

    • Classify (assign discrete labels)
      • Classifying disease subtypes (e.g. breast cancer, lymphoma)
      • Diagnostic biomarkers (e.g. tuberculosis)
    • General prediction
      • Clinical outcomes (e.g. survival time)
      • Protein folding (e.g. Alphafold)
  • Unsupervised learning infers

    • Clusters based on natural structure, patterns, and dependencies within the data
    • Especially useful when explicit labels (ground truth) are unavailable
  • Dimensionality reduction (e.g. PCA, LDA, t-SNE, transformers) is often helpful due to the high dimensionality of proteomics data

Challenges

  • Precisely quantifying low abundance proteins remains difficult
  • Eliminating missing values is challenging

Multiomics

  • Combining proteome data with other technologies (genomics, transcriptomics, metabolomics) leads to comprehensive understanding

  • Comparing proteomics and transcriptomics often reveals differences, highlighting significant transcriptional and post-translational regulation mechanisms

  • Deep learning models are increasingly designed to integrate and analyze multi-omics data effectively

References

Anderson, Leigh, and Jeff Seilhamer. 1997. “A Comparison of Selected mRNA and Protein Abundances in Human Liver.” ELECTROPHORESIS 18 (3-4): 533–37. https://doi.org/10.1002/elps.1150180333.
Chen, Chen, Jie Hou, John J. Tanner, and Jianlin Cheng. 2020. “Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis.” International Journal of Molecular Sciences 21 (8): 2873. https://doi.org/10.3390/ijms21082873.