14  Metabolomics and Pathway Analysis

Author
Affiliation

Dr Randy Johnson

Hood College

Published

October 29, 2025

Acknowledgements

NotebookLM was used to find and query reference materials cited in these notes. The main source used when preparing these notes is an article on pathway analysis for metabolomics by Wieder et al. (2021). The information below comes from that article unless otherwise noted.

Introduction to Metabolomics

  • Metabolites: Small molecules, which are distinct from larger biomolecules like proteins and nucleic acids

  • Metabolomics: Profiling of small molecules (metabolites) within a biological system

  • Aim: Understand how cellular biochemistry correlates with biology
    • Exposure to environmental conditions
    • Differing genetic backgrounds
    • Disease status
  • Datasets tend to cover a much lower proportion of the total metabolome compared to typical transcriptomic coverage

Applications

  • Biomarker discovery

  • Personalized medicine

  • Agriculture (e.g. crop protection and food security)

Profiling Methods

  • Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR) are typical
    • MS-based approaches typically detect a larger number of compounds compared to standard NMR methods
  • Experimental platforms vary, including
    • UPLC-MS/MS (Ultra-Performance Liquid Chromatography - Tandem Mass Spec)
    • CE-TOF MS (Capilary Electrophoresis - Time-of-Flight Mass Spec)
    • Flow injection TOF MS (which may use no chromatography/electrophoresis separation step)

Data acquisition and identification challenges

  • Untargeted Metabolomics: Metabolites are annotated based on
    • Physicochemical properties (e.g. mass-to-charge ratio (\(\text{m/z}\)) and polarity)
    • Similarity to compounds in spectral databases
  • Uncertainty in identification
    • Significant bottleneck is the metabolite identification uncertainty
      • Chemical structures
      • Database identifiers
  • Metabolomics Standards Initiative (MSI) proposes minimum reporting standards for metabolite identification (Chen et al. 2020)
    • Level 1: identified using an authentic chemical standard
    • Level 2: putatively identified based on physicochemicsl properties in a spectral database
    • Level 3: probable/uncertain annotation
    • Level 4: unknown compound
  • Assay bias
    • Specific analytical platform and assay introduce chemical bias
    • Each is better suited to detect compounds with specific physico-chemical properties (e.g. fatty acids, glycans, vitamins, etc…)
    • Limited metabolic network areas are sampled by each assay type

Data preparation

  • Raw metabolite abundance matrices typically need post-processing
    • Imputation of missing values (e.g. using minimum value divided by 2)
    • \(\log_2\) transformation
    • Auto-scaling (subtracting mean and dividing by standard deviation)

Pathway Analysis (PA) Tools

  • PA is essential for the interpretation of high-dimensional molecular data
    • Pathways are collections of molecules participating in the same biological function
    • Find associations between pathways and specific phenotypes
  • PA was originally developed for transcriptomic data but has been adapted for metabolomics

Pathway analysis methods

-   Over-representation analysis (ORA)
    -   Most common PA approach
    -   Identifies pathways that contain a statistically higher number of certian molecules than would be expected by chance
-   Functional class scoring (FCS), e.g. Gene Set Enrichment Analysis (GSEA)
-   Topology-based methods, i.e. network/graph analysis

ORA Inputs

  • Pathway collection obtained from databases like
    • KEGG (Kyoto Encyclopedia of Genes and Genomes)
    • Reactome and
    • BioCyc, or
    • Commercial tools like IPA
  • Differentially abundant (DA) metabolites of interest
    • List derived from experimental data, typically selected using a statistical threshold
  • Background/reference set
    • Contains all compounds realistically detectable by the experiment
    • e.g. all identified compounds in the assay

Generating DA metabolite lists

  • Determined using statistical comparisons of metabolite abundances between study groups

  • Multiple testing correction must be applied to the resulting p-values

    • Benjamini-Hochberg False Discovery Rate
    • Bonferroni correction

Data Visualization Tools

Pathview (an R package) can be used for pathway-based data integration and visualization, often with KEGG pathways. Image courtesy of the Pathview vignette (Weijun Luo 2017).

Data Visualization Tools

MD-plots show the mean abundance of two groups (x-axis) versus the difference or log-fold change (y-axis). In this instance, red points are significantly up-regulated in basal cells when compared to LP, and blue points are significantly down-regulated.

Data Visualization Tools

Heatmaps with clustering

Data Visualization Tools

Scatter plots for dimensionality reduction

Data Visualization Tools

A matrix bar plot shows patterns of small numbers of metabolite differences across comparisons. Image courtesy of Johnson, Lacroix, and Schwarz (2025).

Challenges and Recommendations for Metabolomics Pathway Analysis

Wieder et al. (2021) demonstrated how changes in ORA parameters can drastically change analysis results.

Impact of input parameters on ORA results

  • Background set selection
    • Using a generic, non-assay-specific background set results in an increase in false-positive pathways (erroneously high levels of enriched pathways)
  • DA metabolite selection
    • Selecting a significance threshold (e.g. \(\text{q} \le 0.05\) or \(\text{p} \le 0.1\)) is an arbitrary choice
    • Impacts the number of significant pathways detected
  • Metabolite misidentification
    • ORA is sensitive to even low levels of misidentification, including
    • Pathway loss (false negatives)
    • Pathway gain (false positives)
  • Pathway database incompleteness
    • Databases are constantly evolving
    • Magnitude of changes between releases suggests ORA results may be “somewhat short-lived”
  • Database ID harmonization
    • Converting metabolite identifiers across databases often results in information loss
    • Not all identifiers map well

Best practices in metabolomics ORA

  • Background set
    • Specify an assay-specific background set
    • Typically includes all metabolites identified in the assay
  • Organism specificity
    • Use an organism-specific pathway set if the database supports it
  • Consensus approach
    • Perform ORA using multiple pathway databases (KEGG, Reactome, BioCyc)
    • Derive a consensus pathway signature
  • Statistical rigor
    • Apply multiple-testing correction when selecting both DA metabolites and significant pathways

Functional Implications: Connecting Pathways to Gene and Protein Expression

  • Systems biology views interactions between genes and their functions as a large network
    • Moving beyond the “one gene, one protein, one function” principle (Chen et al. 2020)
  • Abnormal regulation of protein function is a common factor in disease

Pathway analysis as a functional tool

Pathway analysis fundamentally helps understand the functional implications of molecular changes (metabolites, genes, proteins)

Multi-omics integration

  • Combining metabolomics data with other omics technologies (genomics, transcriptomics, proteomics)

    • Chen et al. (2020) discusses multiomics for MS data
  • Integrating multi-omics data provides a more comprehensive understanding of biological systems

  • Can lead to improved predictions, a better understanding of disease mechanisms, and the identification of novel therapeutic targets

  • Discordant trends
    • Pointwise comparisons between proteomics and transcriptomics can reveal discordant trends
    • Indication of significant transcriptional and post-translational regulation mechanisms
  • Bioinformatics tools (often ML/AI-based) can be used to (Yousef and Allmer 2023; Chen et al. 2020)
    • Reconstruct protein interaction and signaling networks from quantitative protein data
    • Link protein changes directly to cellular processes like metabolism and biosignaling

References

Chen, Chen, Jie Hou, John J. Tanner, and Jianlin Cheng. 2020. “Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis.” International Journal of Molecular Sciences 21 (8): 2873. https://doi.org/10.3390/ijms21082873.
Johnson, Randall C, Ian Lacroix, and Benjamin Schwarz. 2025. matrixBP: An R Package to Generate Matrix Bar Plots.” Zenodo. https://doi.org/10.5281/ZENODO.14749076.
Weijun Luo. 2017. “Pathview.” Bioconductor. https://doi.org/10.18129/B9.BIOC.PATHVIEW.
Wieder, Cecilia, Clément Frainay, Nathalie Poupin, Pablo Rodríguez-Mier, Florence Vinson, Juliette Cooke, Rachel Pj Lai, Jacob G. Bundy, Fabien Jourdan, and Timothy Ebbels. 2021. “Pathway Analysis in Metabolomics: Recommendations for the Use of over-Representation Analysis.” Edited by Kiran Raosaheb Patil. PLOS Computational Biology 17 (9): e1009105. https://doi.org/10.1371/journal.pcbi.1009105.
Yousef, Malik, and Jens Allmer. 2023. “Deep Learning in Bioinformatics.” Turkish Journal of Biology 47 (6): 366–82. https://doi.org/10.55730/1300-0152.2671.