9  Bulk RNA-seq

Author
Affiliation

Dr Randy Johnson

Hood College

Published

September 17, 2025

Transcriptomics and RNA-Seq

What is Transcriptomics?

Transcriptomics is the study of the entire set of RNA transcripts (the transcriptome) present in a cell or organism at a specific time or under specific conditions.

Central Dogma

The central dogma of molecular biology states that genetic information flows from DNA to RNA to protein. mRNA acts as a template for protein synthesis.

Relevance

  • Measuring how much mRNA is transcribed can estimate how active a gene is under given circumstances.

  • This provides insight into the regulation status of all genes simultaneously.

  • Gene Regulation Insights
    • Understanding gene regulation under specific developmental stages or physiological conditions has greatly expanded knowledge
    • Example: RNA-seq can identify differentially expressed (DE) genes in a Staphylococcus aureus strain when exposed to oxacillin, compared to a control, to better understand antibiotic resistance mechanisms.

Prokaryotic vs. Eukaryotic Genes

  • Prokaryotic genes are
    • Often organized in operons, groups of functionally-related genes that are collocated in a genome
    • Transcribed in polycistronic units, a single mRNA that contains the coding sequences of multiple genes from a single operon
  • Eukaryotic genes are
    • Transcribed as single gene units
    • Often contain non-translated introns and translated exons

Evolution of Gene Expression Analysis Technologies

Northern blotting

  • RNA is extracted and treated with a denaturing agent
  • RNA fragments are separated by size using gel electrophoresis
  • RNA bands are transferred from the gel to a nylon membrane and fixed
  • Radioactive or chemiluminescent probes are hybridized to the nylon membrane
  • Membrane is washed and imaged to detect the location of target RNAs

Northern blot diagram (“Northern Blot” 2024)

Quantitative PCR (qPCR)

Northern blotting is good for identifying the presence of specific RNA sequences, but real-time quantitative PCR is faster and allows RNA quantification of several target RNAs at a time (Heid et al. 1996).

  • RNA is extracted

  • TaqMan probes matching the start and end of the target sequence(s) are added

  • RNA is converted to cDNA and amplified using PCR (polymerase chain reaction)

  • When TaqMan probes are encountered by the polymerase, the probe is broken up, releasing a fluorescent signal

  • Fluorescence is monitored during the reaction and used to infer the starting quantity of the target RNA

TaqMan probe diagram (“Real-Time qRT-PCR n.d.)

Real-time qPCR fluorescence curve (“Real-Time qRT-PCR n.d.)

Microarrays

Microarrays enabled simultaneous quantification of thousands of RNA targets in paired samples (Stears, Martinsky, and Schena 2003).

  • mRNA is extracted from two samples (e.g. case and control)
  • RNA is converted to cDNA and labeled with fluorescent dyes (one red, one green)
  • Labeled cDNA is hybridized to a chip with complementary DNA probes a specific locations on the chip
  • Unhybridized sample is washed away and the chip is imaged
    • Red spots: gene expression is increased in sample 1
    • Green spots: gene expression is increased in sample 2
    • Yellow spots: gene expression is similar between samples 1 and 2
    • Dark spots: gene expression is low in both samples

Microarray diagram (Stears, Martinsky, and Schena 2003)

Limitations of Traditional Methods

  • Rely on specific hybridization probes and primers
  • You only find what you’re looking for

Image courtesy of Gemini

RNA Sequencing (RNA-seq)

As the cost of NextGen sequencing decreased, RNA-seq replaced hybridization techniques in genome-wide expression studies (i.e. transcriptomics).

What is RNA-Seq?

RNA-seq uses high-throughput sequencing to detect genome-wide transcription.

  • Allows for a genome-wide quantification of RNA
  • Provides a much broader and unbiased view than previous methods

Applications

  • Differential Gene Expression (DGE)
  • Differential alternative splicing
  • Transcript discovery (e.g. long non-coding RNAs, microRNAs)
  • Genome annotation (de novo transcriptome assembly)
  • Allele-specific expression
  • RNA editing, fusion discovery, variant detection

RNA-Seq Experimental Design

RNA-Seq Workflow

  • Extract RNA
  • Converting it to cDNA
  • Sequencing the cDNA
  • Match the sequence data to genes

Focus on mRNA

While total RNA or small RNA can be sequenced, we’ll focus on mRNA from a single culture today.

  • mRNA typically constitutes about 2% of total cellular RNA
  • Assumption: changes in mRNA levels correlate with the phenotype (protein expression)

Experimental Design Considerations

  • Biological Replicates:
    • More biological replicates (at least two, at least three) is required to increase statistical power
    • Needed to distinguish observed differences from external factors and random variation
  • Sequencing Platform:
    • Illumina short-read technology is common
    • Long-read technologies are helpful for detecting alternative splicing and different isoforms
    • 10x genomics is a microfluidics platform - paired with RNA-seq enables single cell sequencing
  • Sequencing Strategy:
    • Paired-end sequencing improves mapping accuracy, especially for differential expression of low-expressed genes
    • Single-end sequencing is less expensive
  • Read length:
    • Typical read length is 50–250 bp
  • Sequencing coverage/depth:
    • Read depth must be considered for each experiment
    • Higher depth ensures better detection of lowly expressed genes
    • More replicates can sometimes be preferred over increased depth
    • Example: 16S rRNA amplicon sequencing typically targets 50,000 reads per sample

Preparing an RNA-seq Library

We’ll review the Illumina Tru-seq protocol for this example (Illumina 2012).

  • Start with 0.1–4 μg of total RNA, aiming for an RNA Integrity Number (RIN) value ≥ 8.
  • Depletion or removal of rRNA
    • Ribosomal RNA (rRNA) makes up the majority of total RNA and must be removed or mRNA must be captured
      • Poly-A selection: Captures mRNA with 3’ polyadenylated poly(A) tails, common in mature eukaryotic mRNA.
      • Ribo-depletion: For whole RNA sequencing or prokaryotic RNA (which lacks poly-A tails), rRNA depletion kits (e.g., RiboZero) are used.
  • RNA is fragmented
  • Conversion of RNA into complementary DNA (cDNA)

  • Addition of sequencing adapters to allow attachment to the flow cell for sequencing

    • Barcodes/indexes are used to multiplex multiple samples in one sequencing run
  • PCR amplification

  • Quality control

RNA-Seq Data Analysis Workflow

Today we will use

  • FastQC for QC
  • trimmomatic for trimming
  • STAR Aligner for mapping reads
  • samtools for processing mapped reads
  • subread for read counting and summarization

QC

  • Use FastQC to check our data
  • Use Cutadapt to trim/filter any bad reads or bases
  • Run FastQC again

Mapping and processing

  • Align reads using STAR
  • Convert output to bam (if it isn’t already)
  • Sort and index the bam file for downstream processing
  • Count the reads in our bam file

Expression Quantification and Differential Gene Expression

  • subread (and other programs) will give us a count matrix for quantification of gene expression.
  • The output should be a matrix of integer counts for each gene (column) and individual (row)

Normalization

  • Normalization is required to account for
    • Sequencing depth bias: If one library has 20M reads and another 40M, the latter will generally show approximately double the counts for most genes.

    • Compositional bias: Highly expressed genes might be over represented at the expense of lowly expressed genes.

Common Normalization Methods

  • Reads Per Kilobase per Million (RPKM):
    • Used for single-end RNA-seq. Calculated by dividing a gene’s read count by its length and the total number of reads.
  • Fragments Per Kilobase Million (FPKM):
    • Used for paired-end RNA-seq.

Differential Gene Expression Analysis

The aim for differential gene expression analysis is to quantify difference in gene expression between two or more treatments or groups.

Statistical Analysis

  • Bioconductor has several good options for statistical analysis, including:
    • DESeq2
    • EdgeR

Multiple Hypothesis Testing

  • When analyzing genomic data, we must account for the number of statistical tests we are performing.
  • The Benjamini–Hochberg and Bonferroni methods are common solutions to effectively reduce false positives.

Visualization

  • Dimensionality reduction is helpful to identify clusters in the data to give some clues about global patterns in the data
    • Principal Component Analysis (PCA) is a common technique but it assumes relationships are linear and tends to have issues with massive datasets like those coming from scRNA-Seq
    • tSNE and UMAP allow for non-linear relationships and are typically used in scRNA-Seq

Functional Annotation and Enrichment Analysis

  • Gene Ontology (GO) terms: Relating genes to biological processes using GO terms (molecular function, biological processes, and cellular components).

  • Enrichment analysis: Identify over represented gene sets that share common functions or pathways

  • Pathway analysis: Identify gene pathways involved in response to disease or drug treatment

Challenges in Annotation

  • Converting protein names to gene names
    • Many-to-many relationships
    • Often labor-intensive and can lead to information loss
    • Outdated databases
  • Inadequate database curation / lack of common data formats for cross-database referencing

  • Automatic annotation procedures can decrease confidence and reliability

    • Manual annotation is often necessary for accurate gene models

Data Visualization Beyond PCA

  • Heatmaps and Boxplots:
    • Present taxonomic comparisons and diversity in microbial communities.
  • Cytoscape:
    • Integrating biological networks and gene expression data
    • Visualizing enriched pathways
  • Network Reconstruction:
    • Proteomics data can be used to reconstruct protein interactions and signaling networks.
    • Most network inference tools are available through R and Python libraries or integrated web services.

References

Heid, C A, J Stevens, K J Livak, and P M Williams. 1996. “Real Time Quantitative PCR.” Genome Research 6 (10): 986–94. https://doi.org/10.1101/gr.6.10.986.
Illumina. 2012. TruSeq DNA Sample Preparation Guide.” 15026486 Rev. C. Illumina. https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_truseq/truseqdna/TruSeq_DNA_SamplePrep_Guide_15026486_C.pdf.
“Northern Blot.” 2024. Wikipedia. https://en.wikipedia.org/w/index.php?title=Northern_blot&oldid=1258100418.
“Real-Time qRT-PCR.” n.d. Accessed September 17, 2025. https://www.ncbi.nlm.nih.gov/probe/docs/techqpcr/.
Stears, Robin L., Todd Martinsky, and Mark Schena. 2003. “Trends in Microarray Analysis.” Nature Medicine 9 (1): 140–45. https://doi.org/10.1038/nm0103-140.