12 Proteomics and Mass Spectrometry
Acknowledgements
Much of the information here comes from a great review article by Chen et al. (2020). NotebookLM was also used to collect and summarize information from many of the articles referenced in the review.
Introduction to Proteomics & Mass Spectrometry
- Proteomics: the study, identification, quantification, and characterization of all proteins in a cell or organism (the proteome).
Proteomics: The large-scale study of proteins
Proteins are fundamental molecules involved in nearly all biological processes, including:
Structure
Metabolism
Signaling
Gene regulation
Immune function
Why Study Proteins Directly?
Central Dogma: DNA \(\to\) RNA \(\to\) Proteins
Transcriptome data (mRNA abundance) is often insufficient to reliably infer protein abundance (Anderson and Seilhamer 1997)
Mass Spectrometry (MS)
Recent advancements in separation and MS technology allow complex biological systems to be studied as integrated units
MS revolutionized proteomics by offering high sensitivity protein identification in complex mixtures
Mass Spectrometry determines the mass-to-charge ratio (\(m/z\)) of ions
MALDI-TOF
- MALDI-TOF is a common approach in high throughput proteomics
- MALDI: Matrix-Assisted Laser Desorption/Ionization
- TOF: Time-of-Flight
- Samples are vaporized and passed through a high-energy field to ionize proteins
- The time it takes peptides to travel the length of the MS column is used to determine the size of the peptide
Tandem Mass Spectrometry (MS/MS)
- Involves selecting peptides and fragmenting them by collision
- Helps achieve sufficient resolution for precise \(m/z\) determination
MS Strategies and Protein Identification
Two general experimental strategies:
Bottom-Up Proteomics (Most Common): Protein samples are proteolytically digested into peptides prior to MS analysis
Top-Down Proteomics: Intact proteins are analyzed directly by MS
Note: we will focus on bottom-up proteomics in this document
Data acquisition for bottom-up proteomics
Proteolytic digestion into smaller peptides
- Required before it can be analyzed in the mass spectrometer
Trypsin is the most common enzyme used to degrade proteins into peptides
- Cuts on the C-terminus of lysine or arginine (unless followed by a proline)
- Other enzymes can be used but are typically less reliable, making downstream analysis more complex
Digestion results in peptides that are easier to distinguish by MS
Digested peptides are then fractionated to reduce the complexity of the sample
- This is often performed using liquid chromatography (LC)
- Liquid solvent caries the sample through a column
- Stationary phase (e.g. consisting of beads or resin) interacts with and slows down some peptides
- Sample is eluted and collected in a series of tubes
- Smaller simpler subsets in each elution group are easier to distinguish, especially lower-abundance peptides
General bottom-up workflow
Following data acquisition:
Raw MS data processing: Identification and quantification of peptides and proteins.
Downstream analysis:
- Data preprocessing
- Statistical analysis
- Enrichment analysis
Peptide and protein identification
The objective is to determine the sequence of peptides from the fragmentation spectra.
- De novo peptide sequencing:
- Analysis of MS/MS spectra to generate short sequence tags
- Database matching:
- Compare sequence tags to a database of known protein sequences
- This narrows the search space
- Protein inference:
- Computationally intensive analysis of the search space for high-confidence calls
Database matching
In silico digestion of protein sequences creates a target database
Peptide Spectrum Match (PSM) score measures the similarity between experimental and theoretical spectra
Selecting appropriate precursor and fragment mass tolerances is vital
- Tolerances too wide increase false PSMs
- Too narrow and we fail to identify PSMs
False discovery rate (FDR)
- Target-Decoy Strategy:
- Performed after database searching
- Searching against a decoy database (e.g. reversed or shuffled sequences)
- Estimates a threshold for removing identifications that lack statistical confidence
- Increases the percentage of true positive hits.
- As protein sequence databases expand, the target-decoy strategy becomes computationally inefficient
DDA vs DIA
Data-Dependent Acquisition (DDA):
- Performs a full MS1 scan and collects the N most intense precursor ions
- Those are passed to the MS2 scan for fragmentation
- This approach is biased in that it filters out low abundance results
Data-Independent Acquisition (DIA):
- All ions are fragmented and detected simultaneously within predefined \(m/z\) ranges
- Offers less biased results
- Downstream analysis is more complex
Protein inference
- Problem: Degenerate peptides
- Shorter peptides may be shared by multiple proteins
- Results in multiple optimal solutions for protein reconstruction
- Possible solutions:
- Parsimonious rule: Use the smallest set of proteins required to account for all detected peptides
- Probabilistic models
- FDR estimation: extension of the target-decoy strategy to the protein level to establish stringent cutoffs
Protein Quantification Strategies
No single standard analysis pipeline exists because differences in upstream experimental techniques require different approaches.
- Labeled methods (Isotope-based)
- Label-free methods (Based on signal intensity/spectral counts)
Labeled quantification: MS1-based
- Different samples are tagged with distinct isotopes
- Differently labeled samples are pooled before digestion
- Peptides from both samples have identical chemical properties and co-elute during fractionation
- Isotopes result in a double peak during MS1
- Peak intensities give us a measure of abundance ratios between samples
- MS2 detection gives us the peptide’s amino acid sequence
Labeled quantification: MS2-based (Isobaric labeling)
Different samples are tagged with chemical groups (e.g. Tandem Mass Tagging or TMT) containing
- Reporter group: isotope with unique mass for each sample
- Balance group: isotope that balances the differences in reporter group masses
- Reporter group mass + balance group mass is the same (isobaric) for all samples
Differently labeled samples are pooled before digestion
Peptides from both samples have identical chemical properties and co-elute during fractionation
Only a single peak per peptide is detected during MS1
Reporter ions break off during MS2, resulting in different peaks for each sample
MS2 detection also gives us the peptide’s amino acid sequence
Pos:
- Achieves relatively large multiplexing capacity (up to 11 samples in one run)
Cons:
- Prone to ratio compression which occurs when biologically interesting peptides co-elute with uninteresting peptides (i.e. peptides with ratios close to 1:1). Noise from uninteresting peptides can skew results
Label-free quantification (LFQ)
- Samples are run in separate LC-MS/MS experiments
- Quantification relies primarily on analyzing the signal intensity of peptide precursor ions
- Normalization is required to address variability
Label-free vs Labeled quantification
- Label-free quantification is
- Cheaper
- Quicker
- More flexible (not all samples are able to be labeled)
- Not limited by multiplexing constraints
- More sensitive (i.e. because we run each sample separately)
- Labeled quantification
- Has less run-to-run technical variability
- Requires fewer biological and technical replicates
- More accurate abundance ratios
Imaging mass spectrometry (IMS)
- Typically used for a thin tissue section (similar to spatial transcriptomics)
- Matrix-Assisted Laser Desorption/Ionization (MALDI) is used to sequentially vaporize specific spots on the tissue section
- Spatial location of proteins is included in the output
Bioinformatics Pipelines: Data Preprocessing & Analysis
Preprocessing
Protein abundance is often inferred from a small number of peptides
Strict cutoffs for minimum peptide numbers must be applied to ensure reliable inference
Preprocessing and normalization are remove non-biological variations and produce reliable, comparable results
Inadequate normalization is a common issue
Protein abundance distributions are often skewed toward zero
- Log transformation is common
Normalization methods
- Variance stabilization normalization:
- Aims to eliminate dependency between variances and mean abundances
- VSN generally performs consistently well in differential abundance analysis
- Linear regression-based:
- Assumes a linear dependency between measurement bias and protein abundance
- Local regression:
- Assumes a nonlinear dependency (e.g. local polynomial regression)
Missing values
Missing values are common due to the stochasticity in sampling, especially for proteins at low concentration
Machine learning models can be used to estimate (impute) missing data points
Statistical analysis
T-test (two groups) and ANOVA (two or more factors) are commonly used
Statistical power is usually an issue
- Cost in terms of money, time, technical complexity and sample availability leads to small sample sizes
Empirical Bayes procedure employed in LIMMA can be used for
- Reduction of measurement variances in pooled estimates
- More robust results, especially on smaller proteomic datasets
Multiple hypothesis testing
- Since thousands of hypotheses (proteins) are tested simultaneously, FDR must be controlled
- Benjamini–Hochberg procedure.
- FDR estimation from permutation.
Enrichment analysis: Identifying functional groups
Proteomics allows hypothesis testing on systemic measurements that incorporate information like PTMs (post-transcriptional modifications)
Requires converting protein identifiers to gene names and functional annotation
Gene Ontology (GO) enrichment
GO enrichment uses a structured vocabulary to characterize protein functions
Three main categories:
- Biological Process
- Molecular Function
- Cellular Component
GO prediction: For proteins lacking full annotation, informative GO terms from proteins with similar sequences to predict functions
Pathway and protein set enrichment
Biological pathway analysis: Uses prior knowledge about regulatory pathway networks and diseases.
Pathway databases:
- KEGG (Kyoto Encyclopedia of Genes and Genomes)
- Reactome
- PANTHER
Protein Set Enrichment Analysis (PSEA)
PTM and interaction databases
PTM databases: Curate detailed information on modification type, position, and disease relations
Protein-Protein Interaction (PPI) databases: Used to predict general interaction networks
Module 5: Advanced Bioinformatics & Future Directions
Machine learning and artificial intelligence (ML/AI)
- ML/AI extracts informative features from large proteomics datasets to construct models for prediction (annotations, locations, clinical outcomes).
Supervised learning builds models to
- Classify (assign discrete labels)
- Classifying disease subtypes (e.g. breast cancer, lymphoma)
- Diagnostic biomarkers (e.g. tuberculosis)
- General prediction
- Clinical outcomes (e.g. survival time)
- Protein folding (e.g. Alphafold)
- Classify (assign discrete labels)
Unsupervised learning infers
- Clusters based on natural structure, patterns, and dependencies within the data
- Especially useful when explicit labels (ground truth) are unavailable
Dimensionality reduction (e.g. PCA, LDA, t-SNE, transformers) is often helpful due to the high dimensionality of proteomics data
Challenges
- Precisely quantifying low abundance proteins remains difficult
- Eliminating missing values is challenging
Multiomics
Combining proteome data with other technologies (genomics, transcriptomics, metabolomics) leads to comprehensive understanding
Comparing proteomics and transcriptomics often reveals differences, highlighting significant transcriptional and post-translational regulation mechanisms
Deep learning models are increasingly designed to integrate and analyze multi-omics data effectively