11 scRNA-seq

Author

Affiliation

Dr Randy Johnson

Hood College

Published

October 8, 2025

Acknowledgments

These notes rely heavily on a publication by Hao et al. (2021) and some tutorials accompanying ShinyCell2 (Ouyang 2025). During the preparation of these notes Perplexity was used for identifying additional source material and Gemini was used to organize and summarize information.

Introduction to Single-Cell Genomics

Single-cell RNA-seq (scRNA-seq): principles and limitations
CITE-seq Technology: a multiomic approach
Data analysis concepts: integrating multiple modalities (WNN Analysis)
Reference mapping

scRNA-seq library prep

See Illumina (2025) and Vargas and Kethireddy (2023) for additional introductory material.

Cells are dissociated into a single-cell suspension
Cells are captured in individual droplets (Gel Beads in Emulsion or GEMs) using the 10x Genomics platform
Each GEM is labeled with a unique barcode

mRNA is converted to cDNA
cDNA is amplified
Sequencing libraries are prepared - different protocols exist for each sequencing target
- Transcriptome
- Antibody tags (ADTs) for cell-surface proteins

Power and limits of scRNA-seq

scRNA-seq profiles the transcriptome (RNA) of thousands of individual cells
- It is very good at characterizing cell types and cellular states in heterogeneous tissues
- It provides an unbiased view of cellular identity
However, RNA doesn’t tell the whole story

Cellular function is determined by RNA and proteins
RNA analysis alone cannot fully account for
- Post-transcriptional modifications
- Protein degradation
- Protein isoform detection

In some cell types (like T cells), scRNA-seq data quality can be technically challenging due to minimal RNA content and high RNase expression (Hao et al. 2021)
Many functionally distinct cell categories cannot be separated based on transcriptomics alone (Ding et al. 2020; Mereu et al. 2020)

Multiomics

Many important sources of cellular heterogeneity may not correlate strongly with transcriptomic features
Multimodal single-cell technologies simultaneously profile multiple data types (modalities) in the same cell
CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by sequencing) is one example

CITE-seq Technology and Workflow

CITE-seq quantifies RNA and cell-surface proteins simultaneously
This combines the unbiased discovery power of scRNA-seq with specific antibody-based analysis
- scRNA-seq (transcriptome) measures the gene expression within the cell.
- Antibody-Derived Tags (ADTs) (epitopes) measure cell surface protein expression

Capture of antibody-derived tags

Staining: Cells are dissociated into a single-cell suspension and stained with a panel of antibodies
Conjugation: Each antibody is attached to a unique oligonucleotide barcode
Sequencing readout: Once the barcoded antibodies bind to the surface proteins, the oligonucleotide tags are sequenced along with the mRNA transcripts.

Why use barcodes?

The number of unique oligonucleotide barcodes that can be used is much greater than number of fluorophores or heavy metal tags traditionally used in flow cytometry or CyTOF, allowing measurement of many markers.

CITE-seq workflow

Cell encapsulation: antibody-labeled cells are captured in Gel Beads in Emulsion (GEMs)
Barcoding: Inside the droplet, each cell is labeled with a unique barcode
- The cell barcode is added to the mRNA transcripts of the cell
- The same cell barcode is added to the labeled ADTs

Library Preparation: Separate sequencing libraries are prepared for the
- mRNA (transcriptome)
- oligonucleotide tags (ADTs)
Sequencing: A single sequencing run quantifies both RNA and ADTs

Data Output: The output consists of two count matrices for the same set of cells:
- one for RNA
- one for ADT

Why is protein data important?

In one study of Cord Blood Mononuclear Cells (CBMCs), independent analysis of RNA clustered CD4+ and CD8+ T cells together because their transcriptomes were highly similar (Hao et al. 2021)

RNA alone has a hard time separating CD4 and CD8 populations

However, the protein data (ADTs for anti-CD4 and anti-CD8 antibodies) clearly separated these populations

Protein data is able to differentiate CD4 from CD8 cells

Takehome message

Protein (ADT) data is best for precisely identifying known phenotypes
RNA data is best for unbiased, unsupervised discovery of novel or subtle states

High-Level Analysis Considerations

CITE-seq data is inherently challenging because the quality and information content of the two modalities vary widely
Protein measurements (ADTs)
- Often have higher copy numbers
- Are more robust (less “drop-out”) than RNA measurements
- If (when) the antibody panel is incomplete, the protein data might miss key cellular differences
A computational workflow must be able to define cell states using both modalities jointly

Weighted Nearest Neighbor (WNN) analysis

WNN Analysis is an analytical framework designed to integrate multiple data types measured within a single cell, leading to a joint definition of cellular state (Hao et al. 2021)
- Unsupervised strategy - it does not require prior knowledge of cell types
- Operates by constructing a single WNN graph, which accurately reflects the combination of both RNA and protein data

Modality weights

WNN assigns cell-specific modality weights to each individual cell
- \(W_{RNA}\) and \(W_{Protein}\)
- Weights reflect the estimated information content of that modality for each specific cell
- They determine relative importance in downstream analysis

How are weights determined?

Neighbor Identification: The algorithm independently finds the nearest neighbors for a target cell based on RNA data alone, and separately based on protein data alone
Profile Prediction: The algorithm then uses those neighbors to predict the target cell’s molecular profile
- Predict the cell’s protein levels from its RNA neighbors
- Predict the cell’s RNA levels from its protein neighbors

Accuracy Assessment: Compare the accuracy of the predictions
- If the protein neighbors yield more accurate predictions than the RNA neighbors
- The cell is assigned a higher protein modality weight
Integration: Weights are used to calculate a weighted average of RNA and protein similarities to build the final WNN graph

WNN analysis

Researchers can perform standard single-cell analysis tasks (like visualization and clustering) on a single, unified dataset
WNN analysis improves the ability to resolve cell states compared to independent analyses.

Modality weight intuition

Cells traditionally hard to separate by RNA (like different T cell subsets) often receive high protein weights, allowing the highly specific protein markers to dominate the cell clustering decision.
Rare progenitor cells, which often lack strong surface markers in a standard antibody panel, typically receive high RNA weights, allowing the transcriptome data to dominate and preserve their unique identity.

Advanced Analysis: Reference Mapping

Large-scale CITE-seq experiments can generate comprehensive multimodal atlases
- We will be looking at one developed by Hao et al. (2021) in class
Atlases provide well-annotated foundations for future studies

Reference mapping

Traditional analysis relies on unsupervised clustering (assuming minimal prior knowledge)
Reference mapping offers a supervised alternative
- A new single-cell experiment (the query) is interpreted using a pre-existing, well-defined atlas (the reference)
This is useful for
- Routine profiling
- Clinical contexts (like COVID-19 studies)
- Large-scale immune profiling

Step 1: Supervised PCA

Supervised PCA (sPCA) incorporates the weights from our WNN analysis to focus on RNA features that are more biologically relevant.

Identify which RNA features correspond to the protein-reinforced cell identities defined by the WNN graph
- PCA maximizes total variance (possibly capturing distracting noise)
- sPCA maximizes variance that captures the structure defined by the WNN graph (Barshan et al. 2011)

WNN analysis (RNA + Protein) “supervises” the PCA to find the optimal transcriptomic gene features that define the cell states
sPCA transformation is calculated once on the reference

Step 2: Projection & transfer

Projection: The sPCA transformation (learned from the reference) is mathematically applied (projected) onto the new query dataset
- This can be used on scRNA-seq data in the absence of protein data
Integration: This projection places the query cells into the same low-dimensional space as the reference

Annotation Transfer: Once integrated, the query cells can be labeled from the reference, typically based on the nearest neighbors from the reference.
Visualization: The query cells can be instantly visualized by projecting them onto the reference UMAP plot

Benefits of multimodal reference mapping

Automated Annotation: Provides high-resolution cell type annotations that would be difficult or impossible to find with unsupervised analysis of the query data alone
Protein Imputation: Reference mapping can impute the protein expression levels (ADTs) for the query cells, even if the query experiment did not measure protein
Increased Accuracy: Using a high-quality multimodal reference improves the accuracy of cell type identification
Discovery of Compositional Shifts: Enables the detection of subtle shifts in cell type abundance across disease conditions

Summary of CITE-seq and WNN Analysis

Summary table courtesy of Gemini

Feature	scRNA-seq (RNA Only)	CITE-seq (RNA + ADT)	WNN Analysis
Data Types	Transcriptome (RNA)	Transcriptome + Surface Protein (ADT)	Integrates RNA and ADT data
Resolution	Limited resolution for similar cell types (e.g., T cells)	Improved resolution, linking genotype and phenotype	Maximal Resolution: Solves the integration challenge
Core Concept	Quantify gene expression in single cells	Use oligo-barcodes on antibodies to sequence protein levels	Calculates cell-specific modality weights (\(W_{RNA}\), \(W_{Protein}\))
Benefit	Unbiased discovery of novel RNA states	Reduced technical noise, enhanced phenotype identification	Allows one modality to compensate for weaknesses in the other

References

Barshan, Elnaz, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. 2011. “Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds.” Pattern Recognition 44 (7): 1357–71. https://doi.org/10.1016/j.patcog.2010.12.015.

Ding, Jiarui, Xian Adiconis, Sean K. Simmons, Monika S. Kowalczyk, Cynthia C. Hession, Nemanja D. Marjanovic, Travis K. Hughes, et al. 2020. “Systematic Comparison of Single-Cell and Single-Nucleus RNA-Sequencing Methods.” Nature Biotechnology 38 (6): 737–46. https://doi.org/10.1038/s41587-020-0465-8.

Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Cell 184 (13): 3573–3587.e29. https://doi.org/10.1016/j.cell.2021.04.048.

Illumina. 2025. “CITE-Seq Introduction.” https://www.illumina.com/techniques/sequencing/rna-sequencing/cite-seq.html.

Mereu, Elisabetta, Atefeh Lafzi, Catia Moutinho, Christoph Ziegenhain, Davis J. McCarthy, Adrián Álvarez-Varela, Eduard Batlle, et al. 2020. “Benchmarking Single-Cell RNA-Sequencing Protocols for Cell Atlas Projects.” Nature Biotechnology 38 (6): 747–55. https://doi.org/10.1038/s41587-020-0469-4.

Ouyang, John F. 2025. “ShinyCell2.” The Ouyang Lab. https://github.com/the-ouyang-lab/ShinyCell2.

Vargas, Derek, and Anantha Kethireddy. 2023. “Introduction to Single Cell Sequencing – Cite-Seq – Series 3 -.” Signosis Bio. https://www.signiosbio.com/blog/introduction-to-single-cell-sequencing-cite-seq-series-3/.