15  Metagenomics

Author
Affiliation

Dr Randy Johnson

Hood College

Published

November 5, 2025

Acknolegments

NotebookLM, Perplexity and Google were used for collecting and summarizing references while preparing these lecture notes.

What is Metagenomics?

  • Metagenomics: the analysis of all genomes (microbiome) from all microbiota in a sample

  • High-throughput sequencing of genetic material recovered directly from environmental samples

  • Like all high-throughput technologies, metagenomics has revolutionized microbiology

  • Applications to human medicine (e.g. the human microbiome)

Targeted Approach: 16S rRNA Amplicon Sequencing

  • 16S rRNA amplicon sequencing

    • Offers quick, cheap sequencing solution to characterize all microbes in a sample (microbiome)

    • Allows annotation using extremely detailed and well-curated taxonomic databases

    • DNA is extracted from the environmental sample (e.g. fecal samples)

    • PCR is used to amplify a variable region (e.g. V4-V5) using primers constructed with adaptors and barcodes

16S rRNA

Molecular structure of a ribosomal subunit from Thermus thermophilus with protein shown in blue and rRNA shown in light brown

  • Entire gene is ~1500 bp long

  • Many regions are highly conserved across all prokaryotes

    • Forms the backbone of ribosomes

    • Encode the structure needed for translation of mRNA to protein

  • Other regions (V1-V9) are hypervariable and correlate strongly with taxonomy (Gray, Sankoff, and Cedergren 1984; Yang, Wang, and Qian 2016)

    • Vary in length from 10 to 100 bp

Secondary structure of 16S rRNA with variable regions annotated (Yarza et al. 2014)

16S bioinformatic pipeline

  • Major software platforms used for analysis include

    • QIIME2 (Quantitative Insights into Microbial Ecology, Python-based)
    • Mothur

Quality control and alignment

  • Raw data undergoes quality trimming to remove low-quality bases
  • Pairing of forward and reverse reads into contigs
  • Highly conserved regions provide a strong anchor

Primers target highly conserved regions near hypervariable regions of 16S RNA (Alexandrino et al. 2021)

Removal of artifacts

  • Chimeras (biased merged sequences from two origins) are identified and removed, as they artificially increase diversity
  • Mitochondrial and chloroplast DNA may be filtered
  • DNA from other domains (e.g. human) are filtered

Grouping reads

  • Reads are grouped into Operational Taxonomic Units (OTUs)
  • Typically clustered using a 97% similarity threshold
  • Corresponds to the taxonomic threshold between prokaryotic species

Grouping reads

Cladogram showing clustering of variants by microbial species (López-Aladid et al. 2023)

Taxonomic Assignment

  • OTUs are aligned against specialized 16S rRNA gene sequence databases for annotation
    • SILVA
    • Greengenes
    • RDP

Alpha (α)-Diversity

  • Measures diversity within a sample group

  • Rarefaction analysis explores if the sequencing depth was sufficient to capture the true diversity (plateauing curve indicates sufficiency)

Image courtesy of QIIME2view

Beta (β)-Diversity

  • Measures diversity between groups (similarity/dissimilarity comparison)

  • Principal Coordinates Analysis (PCoA) can be used to observe sample clustering and differences
    • Dimensionality reduction using dissimilarity matrix as input
    • Different from traditional Principal Component Analysis (PCA) that analyzes features directly (e.g. protein abundance)

PCoA plot of gut microbiota: Ori represents original gut microbiota, VI_bif and VI_starch represent gut microbiota fermented with VI media plus bifidobacteria-produced exopolysaccharides and starch, respectively and VI represents gut microbiota fermented with VI media only (Liu et al. 2019).

Comprehensive Approach: Full Shotgun DNA Metagenomics

  • Full DNA Shotgun Metagenomics sequences all DNA in a sample, aiming to reconstruct genome fragments (Metagenome-Assembled Genomes, MAGs)

  • Provides functional characteristics and allows investigation of the general diversity of all organisms

  • DNA extraction should yield at least 50% microbial DNA

  • Samples are sequenced using high-throughput platforms like Illumina or Nanopore

Pipeline for shotgun data

  • Preprocessing

    • Similar to 16S, reads are demultiplexed and subject to quality control

    • Removal of host DNA by mapping raw reads to the host genome for filtering (e.g. using tools like Bowtie 2)

  • Assembly

    • Environmental / microbiome samples are highly complex

    • Specialized assemblers are used (e.g. metaSPAdes, MEGAHIT)

    • Assess quality of the assemblies (e.g. CheckM)

  • Annotation (functional assignment)

    • Open Reading Frames (ORFs) that encode proteins are predicted

    • Functional annotation compares predicted proteins to databases like M5nr, SEED Subsystems, or KEGG

    • Sequence information linked to function, providing the functional potential of the community

  • Dedicated pipelines/servers

    • MG-RAST (Metagenomics RAST): User-friendly public resource that automates quality control, annotation, and comparative analysis against multiple databases. It uses BLAST or BLAT for searching.

    • MEGAN: Links taxonomy with function using the Lowest Common Ancestor (LCA) algorithm, often comparing against NCBI taxonomy

  • Prediction of function via taxonomy shortcut

    • Tools like PICRUSt predict the abundance of gene families in microbial communities based on 16S rRNA marker gene sequences, providing a functional estimate without full shotgun sequencing

Single-Cell Techniques and Advanced Analysis

  • General similarity to regular scRNA-seq, but also has some unique challenges

  • No standardized methodologies yet (Ling et al. 2025; Gourlé et al. 2025)

  • DNA sequencing for single-cell sequencing follows the same principle as sequencing for single genomes

  • Specialized assemblers (like SPAdes) can handle single-cell genomes/mini-metagenomes

  • scRNA-seq provides high granularity, allowing full insight into the interplay of transcripts within individual cells

Challenges in scRNA-seq for metagenomics

  • Sparsity/dropout (similar to regular scRNA-seq)

    • Measurements often have large fractions of observed zeros, referred to as “dropout”
    • Combination of technical noise and the true biological absence of expression
    • Sparsity hinders downstream analysis
  • Statistical Frameworks

    • New statistical frameworks are needed to deal with the high granularity of changes and the uncertainty in clustering/cell type assignment prior to differential analysis

General computational techniques for single cell metagenomics

  • Data preprocessing and normalization

    • Necessary for deep-learning models and large datasets like metagenomics

    • Helps reduce the impact of noise and ensures comparability

    • Normalization methods include: total count, upper quartile, median, DESeq2 scaling factors, or RPKM/FPKM/TPM which normalize by gene length and reads mapped.

  • ML / AI: MetaVelvet-DL (Liang and Sakakibara 2021)

    • Extension of the MetaVelvet assembler
    • Uses deep learning to more accurately identify and partition de Bruijn graphs
    • Improves genome assembly and resolution of individual species within complex microbial communities
  • Exploratory data analysis

    • Methods like Principal Component Analysis (PCA), t-SNE, and UMAP are employed for dimension reduction and visualization
    • Visualize global gene expression patterns
    • Cluster samples with similar profiles

Summary

Metagenomics (16S or shotgun): characterization of entire microbial communities living in specific environments

Human microbiomes. Image courtesy of geonome.gov.

Example

  • Environment: Human skin
  • Sample: Skin swabs are taken from the faces of many individuals
  • 16S approach: Characterization of microbial diversity on human faces
  • Shotgun metagenomics: Characterization microbial diversity and functional / genetic diversity of microbes on human faces

Future Directions

  • Whole-genome sequencing (WGS) data is increasingly used for classification and identification
  • Single-cell methods are being developed for metagenomics
  • Ongoing improvement of sequencing platforms (like Nanopore) and computational tools (like deep learning assemblers) continues to drive the field forward

References

Alexandrino, F., J. S. Malgarin, M. A. Krieger, and L. G. Morello. 2021. “Optimized Broad-Range Real-Time PCR-Based Method for Bacterial Screening of Platelet Concentrates.” Brazilian Journal of Biology 81 (3): 692–700. https://doi.org/10.1590/1519-6984.229893.
Gourlé, Hadrien, Iryna Yakovenko, Jyoti Verma, Julian Dicken, Florian Albrecht, Linas Mažutis, Johan Normark, et al. 2025. “Scalable Single-Cell Metagenomic Analysis with Bascet and Zorn.” Microbiology. https://doi.org/10.1101/2025.06.20.660799.
Gray, Michael W., David Sankoff, and Robert J. Cedergren. 1984. “On the Evolutionary Descent of Organisms and Organelles: A Global Phylogeny Based on a Highly Conserved Structural Core in Small Subunit Ribosomal RNA.” Nucleic Acids Research 12 (14): 5837–52. https://doi.org/10.1093/nar/12.14.5837.
Liang, Kuo-ching, and Yasubumi Sakakibara. 2021. MetaVelvet-DL: A MetaVelvet Deep Learning Extension for de Novo Metagenome Assembly.” BMC Bioinformatics 22 (S6): 427. https://doi.org/10.1186/s12859-020-03737-6.
Ling, Meilee, Judit Szarvas, Vaida Kurmauskaitė, Vaidotas Kiseliovas, Rapolas Žilionis, Baptiste Avot, Patrick Munk, and Frank M. Aarestrup. 2025. “High Throughput Single Cell Metagenomic Sequencing with Semi-Permeable Capsules: Unraveling Microbial Diversity at the Single-Cell Level in Sewage and Fecal Microbiomes.” Frontiers in Microbiology 15 (February): 1516656. https://doi.org/10.3389/fmicb.2024.1516656.
Liu, Guiyang, Huahai Chen, Junkui Chen, Xin Wang, Qing Gu, and Yeshi Yin. 2019. “Effects of Bifidobacteria-Produced Exopolysaccharides on Human Gut Microbiota in Vitro.” Applied Microbiology and Biotechnology 103 (4): 1693–1702. https://doi.org/10.1007/s00253-018-9572-6.
López-Aladid, Ruben, Laia Fernández-Barat, Victoria Alcaraz-Serrano, Leticia Bueno-Freire, Nil Vázquez, Roque Pastor-Ibáñez, Andrea Palomeque, Patricia Oscanoa, and Antoni Torres. 2023. “Determining the Most Accurate 16S rRNA Hypervariable Region for Taxonomic Identification from Respiratory Samples.” Scientific Reports 13 (1): 3974. https://doi.org/10.1038/s41598-023-30764-z.
Yang, Bo, Yong Wang, and Pei-Yuan Qian. 2016. “Sensitivity and Correlation of Hypervariable Regions in 16S rRNA Genes in Phylogenetic Analysis.” BMC Bioinformatics 17 (1): 135. https://doi.org/10.1186/s12859-016-0992-y.
Yarza, Pablo, Pelin Yilmaz, Elmar Pruesse, Frank Oliver Glöckner, Wolfgang Ludwig, Karl-Heinz Schleifer, William B. Whitman, Jean Euzéby, Rudolf Amann, and Ramon Rosselló-Móra. 2014. “Uniting the Classification of Cultured and Uncultured Bacteria and Archaea Using 16S rRNA Gene Sequences.” Nature Reviews Microbiology 12 (9): 635–45. https://doi.org/10.1038/nrmicro3330.