7  Next-Generation Sequencing

Author
Affiliation

Dr Randy Johnson

Hood College

Published

September 3, 2025

Acknowledgements

Gemini was used for brainstorming and suggesting additional topics.

Introduction to Next-Generation Sequencing (NGS)

Next-gen sequencing is a technology that allows for rapid, high-throughput, cost-effective, parallel sequencing of DNA and RNA.

The Shift from Sanger to NGS

  • Sanger sequencing
    • Predominant sequencing method from 1977 through the mid-2000s
    • It involved the chain terminator method using dideoxynucleotides (ddNTPs) to stop strand elongation, producing fragments of different lengths that were then resolved by electrophoresis.
  1. Chain terminator PCR: PCR is performed with a ddNTPs added into the mix, resulting in many fragments of the original sequence, each with a different length.
  2. Size separation by capillary gel electrophoresis: fragments are sorted by size in a gel.
  3. Laser detection of ddNTPs: a laser is used to detect which ddNTP was used to terminate the PCR reaction for each sequence size.

Three steps of Sanger sequencing (see Sanger sequencing)

Challenges

Sanger sequencing is slow and expensive.

  • Sequencing larger genomes like the 3300 Mb human genome is costly

Advent of NGS

Also known as massive parallel sequencing, NGS technologies enabled the sequencing of small DNA fragments in parallel.

  • NGS offers the ability to multiplex
  • This allows hundreds of samples to be sequenced in one run resulting in
    • Much lower costs
    • Much faster sequencing

Impact of NGS

  • Generated an explosion of “omic” data at full genome-wide scales
    • Challenges for storage and processing
  • Enabled the analysis of thousands to millions of cells in single experiments
    • Opportunities for single-cell analysis
  • Opened new possibilities for
    • Global gene expression studies
    • Methylation patterns, and epigenetic markers

NGS Platforms and Their Underlying Chemistry

NGS technologies

  • Rapidly evolving
  • Dramatic reduction sequencing costs
  • Simplify genome assembly

These platforms generally fall into two main categories:

  • Short-read (second-generation sequencing)
  • Long-read (third-generation sequencing)

Sample purification

Torres Benito (2022) has a good explanation of sample purification.

  • Organic extraction and purification
    • Solvents like phenol and chloroform are used to separate proteins and lipids
    • DNA/RNA is precipitated from the solution using alcohol
  • Silica-based columns
    • Spin the sample through a silica matrix
    • High-salt buffer causes DNA/RNA to bind to the silica
    • Impurities are washed away
    • DNA/RNA is eluted using low-salt buffer or water
  • Magnetic bead-based purification
    • Customizable magnetic beads prepared to stick to desired DNA/RNA
    • Magnet is used to hold beads in place for washing stage
    • DNA/RNA is washed from beads and eluted

Methods for cleaning up DNA and RNA samples (see Torres Benito (2022))

Library preparation

An example of the distribution of read length for different protocols using Covaris Adaptive Focused Acoustics technology

Illumina (Massive Parallel, Short-Read Sequencing)

Illumina sequencing is the most frequently used platform today

Sequencing

  • Fragments are attached to the flow cell surface.
  • Clusters of identical DNA templates (called “spots” or “clusters”) are formed on the flow cell.
  • Fluorescently labeled nucleotides (A, T, C, G) are added in cycles, along with DNA polymerase.
    • For each cycle, only one nucleotide can be incorporated, after which a wash step and imaging of the flow cell occur.
  • Emitted fluorescence from each cluster identifies the incorporated base.
  • Repeat to build the complementary strand.

Fragment clusters are fluorescently labeled and imaged one basepair at a time and imaged to detect the nucleotide sequence (“An Introduction to Next-Generation Sequencing Technology” 2025)

Read Characteristics

  • Produces short reads, typically 50-250 bp
  • Paired-end reads are produced when the ends of a longer sequence are sequenced
  • High accuracy, reported at 99.9%

Applications

  • Small whole genome sequencing (DNA)
  • Transcriptomics (RNA)
  • 16S metagenomics
  • DNA-protein interaction analysis (ChIP-Seq)
  • Multiplexing of samples using barcodes

Ion Torrent (Thermo Fisher Scientific)

  • Benchtop sequencer released in 2010
  • Generates millions of short DNA or RNA reads, similar to Illumina data
  • Its size, speed and ease of use make it ideal1 for
    • Clinical research - personalized medicine like cancer research
    • Microbiology and infectious disease - especially real-time pathogen typing and outbreak surveillance
    • Forensic DNA analysis
    • Agrigenomics

The ion torrent is a small, fast benchtop NGS sequencer.

Sequencing by synthesis

  • Unlike Illumina’s optical detection, Ion Torrent uses ion semiconductor sequencing technology

  • It detects hydrogen ions (H+) released during DNA polymerization

  • Each microwell on the chip contains a single template DNA

  • Unmodified deoxynucleotide triphosphates (dNTPs) are washed into the microwells, one species at a time (e.g., all A’s, then all T’s).

  • When a nucleotide is incorporated into a complementary strand, a covalent bond is formed, releasing a phosphate and a hydrogen ion, which is detected by the sensor.

Ion Torrent technology directly translates chemically encoded information (A, C, G, T) into digital information (0, 1) on a semiconductor chip (Tamang 2024b).

Real-Time, Single Molecule Sequencing (Long-Read Technologies)

  • These technologies overcome some limitations of short-read methods by directly sequencing native DNA without prior fragmentation and amplification steps, which can introduce errors and bias
  • Long-read technologies yield significantly longer read lengths and provide real-time sequence information

Pacific Biosciences (PacBio)

  • PacBio RS system was released in 2011
  • Produces long reads, averaging >15 Kb, with some exceeding 100 Kb

Single molecule, real-time (SMRT) sequencing

  • A single DNA polymerase is immobilized at the bottom of a nanoscopic microwell called a zero-mode waveguide (ZMW)
  • A single DNA template is bound to the polymerase
  • Four fluorescently labeled nucleotides are introduced
  • As a labeled nucleotide is incorporated a fluorescent a light pulse is produced

PacBio sequencing happens inside a SMRT cell containing millions of tiny wells (“How PacBio Works” 2025).

Read Characteristics

  • Advantages: Long reads are valuable for
    • Closing small genomes
    • Spanning long, repetitive regions
    • Can also be used to determine the methylome (epigenetics) of microorganisms
  • Drawbacks:
    • Higher sequencing costs
    • Higher error rates (especially insertion and deletion errors) compared to Illumina

Oxford Nanopore Technologies (ONT)

  • The MinIon device was released in 2014
  • Known for its portability

Nanopore sequencing

  • A single-stranded DNA molecule is driven by an applied voltage (e.g., 180 mV)
  • Passes through a nanopore
  • As DNA threads through the nanopore, it
    • Causes distinct changes in the ionic current
    • Characteristic of the specific nucleotide bases occupying the pore at any given moment

Oxford nanopore sequencing measures ionic current changes as a DNA strand passes through the nanopore (Tamang 2024a)

Read characteristics

  • Produces variable and very long reads, comparable to PacBio, sometimes up to 900 Kb
  • MinIon devices have 512 nanopores, each sequencing approximately 70 bp/second.
  • Advantages:
    • Extreme portability (USB sequencer)
    • Real-time data availability
    • Sequencing can be cheaper than PacBio.
  • Drawbacks:
    • Initially had a high error probability, though recent flow cells (e.g., R10.4.1) report raw read accuracy >99%
    • Requires high-quality DNA input and efficient command-line data analysis

Raw Data Generation and Quality Assessment

  • Sequence Reads
    • Millions of short or long reads
    • Represent fragments of the original DNA/RNA
  • Data format, typically FASTQ
    • Raw sequence reads
    • Stores both the nucleotide sequence and its corresponding quality scores for each base
  • Quality scores, typically Phred Score (Q)
    • Logarithmic measure that reflects the likelihood that a base call is incorrect
    • Formula: Q = -10 log10(p), where ‘p’ is the probability of an incorrect base call
      • Phred score of 10 (Q10) means 1 error in 10 base calls (90% accuracy)
      • Phred score of 20 (Q20) means 1 error in 100 base calls (99% accuracy)
      • Phred score of 33 is generally considered a desirable minimum threshold for high-quality data (99.95% accuracy)
      • Higher Phred scores indicate greater confidence in the reported base

Initial quality control (QC)

  • Assessing the quality of raw data is a critical first step
  • Tools like FastQC are used to generate QC reports, checking for
    • Consistency
    • Base call quality
    • Overall data characteristics like GC-content distribution

NGS Data Analysis

  • Bioinformatics helps us
    • Extracting biological meaning
    • Discover relationships between molecular mechanisms and phenotypic behavior
  • General workflow
    • Preprocessing and QC
    • Assembly
    • Annotation
    • Quantification
    • Downstream interpretation

Data preprocessing

  • Quality control (QC) using tools like FastQC
    • Evaluate the raw sequencing data to ensure it meets acceptable quality thresholds
  • Demultiplexing
    • DNA barcodes can be added to multiple samples to distinguish different samples
    • Barcoded samples can be pooled and sequenced together to save on costs
  • Trimming using tools like trimmomatic to remove
    • Low-quality bases from the ends of reads
    • Adapter sequences
    • Barcodes
  • Filtering removes reads that are
    • Too short
    • Contain too many ambiguous bases
    • Contain potential sequencing errors (e.g. remove low-frequency k-mers to correct errors)
  • Denoising
    • Computational methods to correct errors introduced during the sequencing process
    • Denoising algorithms use statistical models that consider quality scores to estimate the probability of sequence errors
Tip

Denoising is especially important in 16s RNA amplicon sequencing to distinguish:

  • Low-frequency variants (real) from PCR amplified sequencing errors (not real)
  • Number of truly unique sequences, used as a proxy for the number of species / OTUs
  • Closely related species with highly similar amplicon sequences
  • Quality control
    • Processed reads are re-evaluated to evaluate the efficacy of the preceding steps
  • Normalization of expression and abundance
    • Essential for ensuring comparability across different datasets and samples
    • Important for
      • RNA expression
      • Metabolomics
      • Proteomics

Sequence assembly

Goal: Reconstruct longer, contiguous DNA sequences (contigs, scaffolds, or even full chromosomes) from many sequence reads.

Illustration of read assembly into contigs, mapping those contigs onto a scaffold and assembling scaffolds into full chromosomes (Guo et al. 2017)
  • De novo assembly
    • Used when no reference genome is available
    • Used to detect structural variation (e.g. large indels, chromosomal inversions / duplications)
    • SPAdes is a popular tool for genomes and single-cell sequencing
  • Reference-guided assembly/mapping
    • Aligning reads to an existing, closely related reference genome
    • Bowtie is used for aligning sequence reads in RNA-seq
    • Hisat2 is used for spliced read mapping
    • Samtools handles mapped reads
  • Assembly quality assessment
    • N50: length of the shortest contig such that 50% of the assembly is covered by contigs of this length or longer
    • NG50: N50 but with reference to the known genome size
    • Completeness:
      • How many genes are represented when mapping RNA-seq data?
      • For de novo assembly, Benchmarking Universal Single-Copy Orthologs (BUSCO) is a set of core genes that are highly conserved and expected to be present across a specific taxonomic lineage

Annotation

Goal: Identify and label all relevant biological features within the assembled genomic sequence.

  • Prediction of coding genes

  • Identification of other structural genes (e.g. rRNA genes)

  • Functional prediction
    • Existing protein databases can be used to infer function based on similarity (e.g. NCBI’s GenPept, UniProt)
  • Functional annotation (e.g. using Gene Ontology (GO) terms)
  • Common tools:
    • RAST (Rapid Annotation using Subsystems Technology) for microbial genomes
    • BlastKOALA for single genomes
    • GhostKOALA for metagenomes
    • Prokka for prokaryotic genome annotation
    • Prodigal for prokaryotic gene recognition
  • Challenges: Automated annotation methods can be faster but may reduce confidence due to dissimilar results from different servers/databases, and often require manual curation for accuracy.

Quantification and differential expression analysis

Goal: Measure the gene expression or protein abundance and identify changes between different experimental conditions.

  • Transcriptomics (RNA-seq)
    • Measures the level of transcribed genes to detect active genes genome-wide
  • Quantification
    • After alignment of reads to a reference genome, number of reads mapping to each gene or transcript is quantified
    • Differential gene expression (DGE): Statistical analysis to identify genes whose expression levels significantly change between conditions (e.g., treated vs. control samples)
  • Tools:
    • DESeq2
    • EdgeR
    • BaySeq
    • DRIMSeq
    • Sleuth
    • StringTie

Downstream Analysis and Interpretation

Statistical analysis and visualization

Volcano plots visualize differential expression / abundance in context with statistical significance. Arshad (2024) gives step-by-step instructions for creating volcano plots in R.

log2 fold change (x-axis) is plotted against -log10 p-value (y-axis) (Arshad 2024)

MA-plots visualize protein differential abundance. Red points represent genes that are upregulated in Group 1 compared to Group 2, and blue points represent downregulation (Kassambara 2016).

Example MA-plot from the ggpubr package.

Principal Component Analysis (PCA) can be used for dimensionality reduction and visualizing data clustering (Cheng (2022) has a nice discussion of PCA).

Example PCA plot generated by Casey Cheng.

Uniform manifold approximation and projection (UMAP) is a more appropriate dimensionality reduction technique for data that violate the linearity assumption of PCA. Mueler and Casimo (2025) have a good description of UMAP with examples.

~4 million single-cell transcriptomes from adult mouse brain labeled by source brain region by UMAP courtesy of Yao et al. (2023)

Enrichment or pathway analysis can identify biological pathways or gene sets that are statistically over-represented in a list of differentially expressed genes or proteins. For example the following are widely used:

  • KEGG (Kyoto Encyclopedia of Genes and Genomes)
  • Reactome

Quantitative protein data can be used to reconstruct protein interactions and signaling networks. The STRING database provides quality-controlled protein-protein association networks.

Machine learning applications

Increasingly used to glean insights from complex biological data.

  • Sample Clustering
  • Classification
  • Deep Learning (e.g. AlphaFold for protein structure prediction)

Challenges in NGS Data Science

  • Data volume and storage

  • Computational resources

  • Data quality issues

    • Sparsity (e.g. in scRNA-seq, a large fraction of observed zeros (dropout events) can hinder downstream analysis)
    • Missing values (e.g. low-abundance proteins can result in missing values)
    • Noise and errors (e.g. Technical variability)
  • Model interpretability, especially in deep learning

  • Data integration

“An Introduction to Next-Generation Sequencing Technology.” 2025. Illumina. https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf.
Arshad, Tanzeela. 2024. “Step-by-Step Guide to Creating a Volcano Plot RNA-Seq.” Data Science for Bio. https://datascienceforbio.com/volcano-plot-rna-seq/.
Cheng, Casey. 2022. “Principal Component Analysis (PCA) Explained Visually with Zero Math.” Towards Data Science. https://towardsdatascience.com/principal-component-analysis-pca-explained-visually-with-zero-math-1cbf392b9e7d/.
Guo, Yan, Yulin Dai, Hui Yu, Shilin Zhao, David C. Samuels, and Yu Shyr. 2017. “Improvements and Impacts of GRCh38 Human Reference on High Throughput Sequencing Data Analysis.” Genomics 109 (2): 83–90. https://doi.org/10.1016/j.ygeno.2017.01.005.
“How PacBio Works.” 2025. Blog. Pacific Biosciences. https://www.pacb.com/technology/hifi-sequencing/how-it-works/.
Kassambara, Alboukadel. 2016. “Ggpubr: ’Ggplot2’ Based Publication Ready Plots.” https://doi.org/10.32614/CRAN.package.ggpubr.
Mueler, Madison, and Kaitlyn Casimo. 2025. “What Is a UMAP.” Blog. Allen Institute. https://alleninstitute.org/resource/what-is-a-umap/.
Tamang, Sanju. 2024a. “Oxford Nanopore Sequencing: Principle, Protocol, Uses, Diagram.” Blog. Microbe Notes. https://microbenotes.com/oxford-nanopore-sequencing/.
———. 2024b. “Ion Torrent Sequencing: Principle, Steps Method, Uses, Diagram.” Blog. Microbe Notes. https://microbenotes.com/ion-torrent-sequencing/.
Torres Benito, Laura. 2022. “Clean up Your Act: DNA and RNA Purity.” Blog. BioEcho. https://www.bioecho.com/usd-na/blog/clean-up-your-act-dna-and-rna-purity.
Yao, Zizhen, Cindy T. J. Van Velthoven, Michael Kunst, Meng Zhang, Delissa McMillen, Changkyu Lee, Won Jung, et al. 2023. “A High-Resolution Transcriptomic and Spatial Atlas of Cell Types in the Whole Mouse Brain.” Nature 624 (7991): 317–32. https://doi.org/10.1038/s41586-023-06812-z.

  1. See key Ion Torrent applications↩︎