10 Differential Gene Expression
Acknowledgements
This workflow follows the Biodonductor vignette titled RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR (Law et al. 2018).
Workflow Overview
| Section | Description | Key Tools/Methods |
| Data Wrangling | Import raw gene counts, sample information, and gene annotations. | edgeR::readDGE, DGEList object |
| Pre-processing | Filter out non-expressed genes and normalize sample distributions. | Filtering (Low Counts), TMM Normalization |
| Exploratory Analysis | Visualize sample relationships and check for batch effects. | MDS Plots (PCA) |
| Differential Expression | Model the data, estimate the mean-variance relationship, and fit linear models. | limma::voom, limma::lmFit, Empirical Bayes |
| Interpretation | Summarize results and visualize top findings. | limma::topTable, Glimma plots |
Data wrangling
- Download and unpack data
- Read raw count matrix (genes × samples) using
edgeR::readDGE - Add sample metadata (e.g. group, batch) to the
DGElistobject
Data pre-processing
- Filtering genes with consistently low expression across all samples
- Focuses on tests with sufficient statistical power
- Reduces the multiple testing burden
- Normalize data to account for
- Differences in sequencing depth (library size)
- Compositional biases (e.g. a few highly expressed genes skewing the total count)
- Trimmed Mean of M-values (TMM)
- calculates scale factors
- Result: log-fold-changes between samples are centered at zero on average
edgeR::calcNormFactors
Exploratory analysis
- Multi-Dimensional Scaling (MDS) Plot (similar to PCA)
- Visualize the distances/relationships between samples using log2 count per million values
- Samples should cluster tightly by biological group
- Samples should not cluster by batch
Design matrix and contrasts
- Design Matrix: A matrix that codes varialbes to be modelled:
- Experimental groups/conditions (e.g., Basal, LP, ML)
- Any confounding factors (e.g., Batch/Lane)
- Contrasts: Specific comparisons of interest, for example:
- “LP vs. Basal”
- “ML vs. LP”
Linear modeling
- RNA-seq counts tend to be heteroscedastic
- Variance of a gene’s expression level depends on its mean
- e.g. higher mean → higher variance
- Linear models assume normally distributed data and constant variance (homoscedasticity)
limma::voom- Models the mean-variance relationship of the log-counts
- Uses these precision weights in the linear modeling step
- Effectively removes heteroscedasticity and
- Make the data suitable for linear modeling
Model fitting and multiple comparisons
limma::lmFit- Fits a linear model for every gene based on the design matrix and
voomweights
- Fits a linear model for every gene based on the design matrix and
limma::eBayes- Moderates the gene-wise variance estimates
- More stable and reliable p-value calculations across all tests
Reviewing results
limma::topTablesorts results (e.g. by significance)- logFC: log-Fold Change
- AveExpr: average log count per million expression
- t: t-statistic
- P.Value: raw p-value
- adj.P.Value (FDR): adjusted p-value for multiple testing
Visualization
- Mean-Difference (MD) Plot
- Plots logFC vs. Average Expression (AveExpr/log-CPM)
- Shows the overall pattern of DE and highlights significant genes
- Interactive MD Plot
- Allows users to click, hover, and search for specific genes on the MD plot
- Heatmaps
- Visualize the expression levels of the top DE genes across all samples
- Allows visualization of sample clustering
- Shows expression patterns
References
Law, Charity, Monther Alhamdoosh, Shian Su, Xueyi Dong, Luyi Tian, Gordon K Smyth, and Matthew E Ritchie. 2018. “RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR.” Vignette. RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR. https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html.