10  Differential Gene Expression

Author
Affiliation

Dr Randy Johnson

Hood College

Published

September 24, 2025

Acknowledgements

This workflow follows the Biodonductor vignette titled RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR (Law et al. 2018).

Workflow Overview

Section Description Key Tools/Methods
Data Wrangling Import raw gene counts, sample information, and gene annotations. edgeR::readDGE, DGEList object
Pre-processing Filter out non-expressed genes and normalize sample distributions. Filtering (Low Counts), TMM Normalization
Exploratory Analysis Visualize sample relationships and check for batch effects. MDS Plots (PCA)
Differential Expression Model the data, estimate the mean-variance relationship, and fit linear models. limma::voom, limma::lmFit, Empirical Bayes
Interpretation Summarize results and visualize top findings. limma::topTable, Glimma plots

Data wrangling

  • Download and unpack data
  • Read raw count matrix (genes × samples) using edgeR::readDGE
  • Add sample metadata (e.g. group, batch) to the DGElist object

Data pre-processing

  • Filtering genes with consistently low expression across all samples
    • Focuses on tests with sufficient statistical power
    • Reduces the multiple testing burden
  • Normalize data to account for
    • Differences in sequencing depth (library size)
    • Compositional biases (e.g. a few highly expressed genes skewing the total count)
  • Trimmed Mean of M-values (TMM)
    • calculates scale factors
    • Result: log-fold-changes between samples are centered at zero on average
    • edgeR::calcNormFactors

Exploratory analysis

  • Multi-Dimensional Scaling (MDS) Plot (similar to PCA)
    • Visualize the distances/relationships between samples using log2 count per million values
    • Samples should cluster tightly by biological group
    • Samples should not cluster by batch

Design matrix and contrasts

  • Design Matrix: A matrix that codes varialbes to be modelled:
    • Experimental groups/conditions (e.g., Basal, LP, ML)
    • Any confounding factors (e.g., Batch/Lane)
  • Contrasts: Specific comparisons of interest, for example:
    • “LP vs. Basal”
    • “ML vs. LP”

Linear modeling

  • RNA-seq counts tend to be heteroscedastic
    • Variance of a gene’s expression level depends on its mean
    • e.g. higher mean → higher variance
  • Linear models assume normally distributed data and constant variance (homoscedasticity)
  • limma::voom
    • Models the mean-variance relationship of the log-counts
    • Uses these precision weights in the linear modeling step
    • Effectively removes heteroscedasticity and
    • Make the data suitable for linear modeling

Model fitting and multiple comparisons

  • limma::lmFit
    • Fits a linear model for every gene based on the design matrix and voom weights
  • limma::eBayes
    • Moderates the gene-wise variance estimates
    • More stable and reliable p-value calculations across all tests

Reviewing results

  • limma::topTable sorts results (e.g. by significance)
    • logFC: log-Fold Change
    • AveExpr: average log count per million expression
    • t: t-statistic
    • P.Value: raw p-value
    • adj.P.Value (FDR): adjusted p-value for multiple testing

Visualization

  • Mean-Difference (MD) Plot
    • Plots logFC vs. Average Expression (AveExpr/log-CPM)
    • Shows the overall pattern of DE and highlights significant genes
  • Interactive MD Plot
    • Allows users to click, hover, and search for specific genes on the MD plot
  • Heatmaps
    • Visualize the expression levels of the top DE genes across all samples
    • Allows visualization of sample clustering
    • Shows expression patterns

References

Law, Charity, Monther Alhamdoosh, Shian Su, Xueyi Dong, Luyi Tian, Gordon K Smyth, and Matthew E Ritchie. 2018. RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR.” Vignette. RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR. https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html.