16 Review

Author

Affiliation

Dr Randy Johnson

Hood College

Published

November 12, 2025

Acknowlegments

NotebookLM, Perplexity and Google were used for collecting and summarizing references while preparing these lecture notes.

Overview

This week we will be reviewing 2 recent papers the cover the topics we have been learning about this term.
Each of you will have the opportunity to select, review and share a paper with the class next week.
We will also review some sample exam questions (see slides for these questions).

Papers

Logsdon et al. (2025) discusses a study including 65 diverse human genomes, which they used to close many existing gaps in the current human reference genome
Mubarak and Zahir (2022) discusses recent transcripomics studies that have lead to advances in personalized medicine

Genomics for a Better Reference

Logsdon et al. (2025) details the sequencing and analysis of 65 diverse human genomes to improve the understanding of human genetic diversity and enhance genomic reference resources.

Background required to understand the paper

Diverse sets of complete human genomes are required
- Construct a robust pangenome reference
- Characterize complex structural variation (SV)

Long-Read Sequencing technologies
- Generate the first complete human genome sequence
- Required for highly contiguous diploid assemblies
- Significantly increases sensitivity for detecting SVs (variants >50 bp in length)

Phasing
- Long-read data is coupled with phasing data
- Hi-C, Strand-seq or trio data to assemble both haplotypes accurately

Previous Gaps
- Gaps continue to persist genome assemblies
- Genetically complex loci are especially difficult to close
- Examples: centromeres, MHC, highly repetitive regions

Phasing with trios

Parent-child trios are sequenced
Enough genetic diversity exists between parents that haplotypes in the child are unambiguous

Phasing with Hi-C

DNA is cross-linked with formaldehyde
Links are created between DNA that is nearby in 3D space but potentially far appart in the 2D sequence
When DNA is fragmented, chimeric reads are created
Paired-end chimeric reads are aligned to the linear genome
Read pairs that originate from distant locations are inferred to be from the same haplotype

Strand-seq

mRNA is transcribed from a single DNA molecule
Complete mRNA strands are mapped back to the DNA to determine haplotypes

What fundamental questions does the paper address?

The paper addresses how to create a comprehensive and diverse genomic resource to resolve the most challenging regions of the human genome and utilize this resource to improve variant detection

What is the full extent of complex structural variation across diverse human populations?
Can near-complete, phased assemblies of diverse genomes be achieved, closing gaps, particularly at centromeres and complex segmental duplications (SDs)?

Can complex, medically relevant loci (e.g. MHC/HLA) be sequenced to complete continuity?
Can these highly contiguous assemblies improve genotyping accuracy and the detection of rare SVs from standard short-read data?

Study design

Aim: produce a genetically diverse sampling of nearly gapless chromosomes
Haplotype Assembly: 130 haplotype-resolved assemblies were constructed from 65 diploid individuals
The approach combined the complementary strengths of PacBio HiFi reads and ultra-long Oxford Nanopore reads
Additional multiomics data includes Strand-seq, Hi-C, RNA-seq, Iso-seq

Pangenome Integration:
- Newly assembled genomes were combined with the draft pangenome reference
- Create a pangenome graph for subsequent genome inference and genotyping experiments
Multiple variant callers were used, including: PAV, DipCall, SVIM-asm, SV-Pop

Sample collection

65 human lymphoblastoid cell lines were selected for sequencing
63 of 65 came from the 1000 Genomes Project (1kGP) cohort

Additional two samples selected to add diversity
- African, American, East Asian, European, South Asian
- 28 population groups
- Maximize genetic and Y chromosome lineage diversity

Coverage:
- ~47-fold for PacBio HiFi
- ~56-fold for Oxford Nanopore (about 36-fold ultra-long)
Three family trios included for validation
12 individuals had long-read Iso-Seq data generated for functional analysis

Data analysis

130 resulting haploid assemblies were
- Highly contiguous (median contig size of 137 Mb)
- Highly accurate (median base-pair quality value between 54 and 57)
- Estimated to be 99% complete for known single-copy genes

Assemblies closed 92% of previously reported gaps
602 chromosomes were assembled as a single gapless contig (telomere to telomere (T2T))
An additional 559 as a single scaffold

Variant Discovery
- Phased assemblies identified 188,500 SVs, 6.3 million indels, and 23.9 million single-nucleotide variants
- Overall size of the SV callset increased by 59% compared to previous efforts

Segmental duplications
- Average of 168.1 Mb of segmental duplications were found per human genome
- Genomes with African ancestry had the highest number (3.97 Mb per individual)
1,247 Complex Structural Variations were identified across all genomes
1,246 centromeres were completely and accurately assembled

Genotyping and pangenome

Pangenome graph was used to genotype variants across 3,202 1kGP individuals using short reads
- Average of 26,115 SVs detected per genome

Figure 1g from Logsdon et al. (2025), showing the number of new SVs per individual, colored by population.

Genome reconstruction from short reads using this method reached a median k-mer-based quality value (QV) of 45
More rare SVs inferred than previous short-read methods
- 1,490 rare SVs per African individual (compared to 382 previously)
New assemblies increased accuracy to 86.3% (up from ~81%) for medically relevant genes

Conclusions

Successfully sequenced 65 diverse human genomes, producing 130 highly contiguous haplotype-resolved assemblies
Reducing remaining gaps in human genome references by 92%

Assemblies provided complete sequence continuity for highly complex loci like MHC, and fully resolved 1,246 human centromeres

Figure 3c from Logsdon et al. (2025), showing high-resolution complexity of repeat maps and locations of gene/pseudo gene in the HLA-DR region

Centromeric analysis revealed
- Substantial variation in \(\alpha\)-satellite length (up to 30-fold variation)
- Documented mobile element insertions into these regions

Figure 5d from Logsdon et al. (2025), showing diversity in centromere size on chromosome 2

Integrating these assemblies into a pangenome reference significantly increased genotyping accuracy when using short-read data only
Detection of a substantially higher number of SVs per individual (26,115 SVs on average)

Transcriptomics for Precision and Personalized Medicine

Mubarak and Zahir (2022) reviews recent recent transcriptomics contributions toward personalized and precision medicine.

Background required to understand the paper

Transcriptomics is the study of all RNA molecules in a sample
- mRNA, which codes for protein
- rRNA, which codes for ribosomes
- tRNA, is used in protein synthesis
- Other non-coding RNAs like lncRNA and miRNA

~70–80% of the human genome is transcribed, but only about 2% is protein-coding
Epitranscriptomics refers to the study of post-transcriptional chemical modifications (e.g. methylation) of RNA
Precision vs. Personalized Medicine
- Precision medicine seeks to pinpoint the precise cause of a disease
- Personalized medicine seeks to identify the cause of the disease within individual contex (e.g. genomic background, health metrics, environment)

What fundamental questions does the paper address?

Review paper: synthesizes many studies to form general conclusions
What are the impacts of transcriptomics and epitranscriptomics on personalized and precision medicine efforts?

What are the major efforts concerning transcriptomics/epitranscriptomics in key disease areas:
- Cancer
- Cardiovascular disease (CVD)
- Neurodevelopmental disorders (ND)

How are advances being made in both basic research into disease pathophysiology and diagnostics?
How does the specialized area of epitranscriptomics contribute to the understanding and study of neurodevelopmental disorders?

Study design

This is a review article. It does not report on an original research study design but rather synthesizes findings from existing research:

Summarizes and overviews major transcriptomics and epitranscriptomics efforts
Focuses on major advances in transcriptomics for three diseases: cancer, CVD, and ND
The recently emerging field of epitranscriptomics is discussed
- Focusing in depth on its contributions to neurodevelopmental disease

Sample collection

As a review, the authors did not perform sample collection; however, the paper details various sample types and methodologies employed by the studies being reviewed:

Typical preparation involves
- Isolating total RNA
- Depletion of ribosomal RNA
- Sometimes mRNA is pulled down via its poly-A tail prior to sequencing

Sample Types in Reviewed Studies
- Single-cells
- Tissue-level samples (e.g. cardiac, brain, blood, saliva)
- Spatial transcriptomics for cancer diagnostics and tracing tumor evolution

Data analysis

The primary output discussed for both microarray and RNAseq assays is a differential gene expression (DEG) profile. Data analysis in the reviewed fields emphasizes:

Microarray experiments comparing gene expression levels between samples (e.g. two-color arrays) or comparing levels to a normal reference (one-color arrays)
Regulatory RNA characterization, including lncRNA and miRNA

Analysis methods include
- Transcriptome-wide association studies
- eQTL (expression quantitative trait locus/loci) characterization
Transcriptomics assays often work best when analyzed in the context of multi-omics
- These serve to validate each other
- Present a more complete picture of genomic activity and biological impact

Conclusions

Significant progress has been made in cancer, CVD, and, to some extent, in ND
Advances are occurring in three areas:
- Characterizing normal and diseased tissue states
- Developing accurate diagnostic profiles that are both personalized and precise
- Investigating the impact of environmental interventions and therapies

Epitranscriptomics, especially m6A RNA modifications, appears to be especially impactful in ND
- The brain appears particularly susceptible to epitranscriptome signaturing
- Brain epitranscriptome is mechanistically significant for brain plasticity

Therapeutic inroads have been published for
- Cancer (e.g. precision tumor drug targeting)
- CVD (risk mitigation via lifestyle interventions)
- Precision therapeutics for ND is currently lagging

As the costs of these screens reduce and databases of transcriptome profiles grow in size and rigor, interpretation of transcriptomics will become easier
- Leading to more advances in diagnostics, prophylaxis, and therapeutics

Specific highlights

Haferlach and Schmidts (2020) reported the use of transcriptomics (among other technologies) for acute myeloid leukemia (AML) diagnosis.

Figure 1 from Haferlach and Schmidts (2020)

Aghdam et al. (2019) reported on the use of miRNAs in prostate cancer prognosis, diagnosis and perhaps treatment.

Illustration of the genesis and impact of an miRNA on gene expression (Venneri and Passantino 2023)

Li et al. (2021) showed strong correlation between lung cancer survival and specfiic long noncoding RNAs (lncRNA).

References

Aghdam, Arad Mobasher, Atefeh Amiri, Reza Salarinia, Aria Masoudifar, Faezeh Ghasemi, and Hamed Mirzaei. 2019. “MicroRNAs as Diagnostic, Prognostic, and Therapeutic Biomarkers in Prostate Cancer.” Critical Reviews in Eukaryotic Gene Expression 29 (2): 127–39. https://doi.org/10.1615/CritRevEukaryotGeneExpr.2019025273.

Haferlach, Torsten, and Ines Schmidts. 2020. “The Power and Potential of Integrated Diagnostics in Acute Myeloid Leukaemia.” British Journal of Haematology 188 (1): 36–48. https://doi.org/10.1111/bjh.16360.

Li, Albert, Wen-Hsuan Yu, Chia-Lang Hsu, Hsuan-Cheng Huang, and Hsueh-Fen Juan. 2021. “Modular Signature of Long Non-Coding RNA Association Networks as a Prognostic Biomarker in Lung Cancer.” BMC Medical Genomics 14 (S3): 290. https://doi.org/10.1186/s12920-021-01137-0.

Logsdon, Glennis A., Peter Ebert, Peter A. Audano, Mark Loftus, David Porubsky, Jana Ebler, Feyza Yilmaz, et al. 2025. “Complex Genetic Variation in Nearly Complete Human Genomes.” Nature 644 (8076): 430–41. https://doi.org/10.1038/s41586-025-09140-6.

Mubarak, Ghada, and Farah R. Zahir. 2022. “Recent Major Transcriptomics and Epitranscriptomics Contributions Toward Personalized and Precision Medicine.” Journal of Personalized Medicine 12 (2): 199. https://doi.org/10.3390/jpm12020199.

Venneri, Maria, and Andrea Passantino. 2023. “MiRNA: What Clinicians Need to Know.” European Journal of Internal Medicine 113 (July): 6–9. https://doi.org/10.1016/j.ejim.2023.05.024.