8 NGS Assembly & Annotation

Author

Affiliation

Dr Randy Johnson

Hood College

Published

September 10, 2025

Acknowledgements

Preparation of these notes included brainstorming and summarization by Gemini.

Genome Assembly and Annotation

Recap

Last week, we discussed NGS technologies, which generate millions to billions of short or long DNA reads.
These raw reads represent fragments of an organism’s entire genome or transcriptome.
These technologies have revolutionized biology, making genome-wide studies feasible and affordable.

The Challenge: Making Sense of the Data

The vast amount of raw sequencing data presents significant computational challenges for storage and processing.
Raw reads are meaningless on their own; they need to be pieced together and interpreted to extract biological information.
This week, we will focus on two critical bioinformatics processes that turn raw sequencing data into useful biological knowledge:
- Genome Assembly
- Genome Annotation

Initial Steps in Data Processing

Base Calling: Converting the raw electronic signals from the sequencing machine into nucleotide information (A, T, C, G) and assigning a quality score (Phred score) to each base call.
Trimming and Quality Control: Removing low-quality bases, adapter sequences, and filtering out low-quality reads to reduce noise and errors. Commonly used programs:
- FastQC
- Trimmomatic

Assembly Approaches

De Novo Assembly

Reconstructing a genome entirely from scratch without using a pre-existing reference genome. Useful for:

Newly characterized organisms
Capturing large-scale structural variation

Algorithms

Overlap-Layout-Consensus (OLC):
- Historically used for Sanger reads
- Finds overlapping regions between reads
- Builds a graph of these overlaps
- Generates a consensus sequence (contig)
- Computationally demanding
De Bruijn Graphs:
- More efficient for massive parallel short-read data
- Reads are broken into smaller “k-mers” (short words of DNA sequence)
- k-mers form nodes in a graph
- Edges connect k-mers that overlap by k-1 bases

Tools

Velvet
SPAdes
CANU
Flye
MaSuRCA

SMARTdenovo
MECAT
ABySS2
MEGAHIT
SOAPdenovo

Reference-Guided/Assisted Assembly

Aligning sequencing reads to a known, closely related reference genome.

Useful for:
- Simplifies the process
- Reduces computational requirements
Drawback:
- May introduce biases from the reference genome

Tools

bwa
Bowtie
HISAT2
Kallisto

Challenges in Assembly

Repetitive Sequences: Long stretches of repetitive DNA are difficult for assemblers to resolve, leading to fragmented assemblies or misassemblies.
Heterozygosity/Ploidy: Genomes with high heterozygosity (variations between two copies of a chromosome) or polyploidy (multiple sets of chromosomes) are challenging, as assemblers often collapse these differences into a single consensus sequence.
Gaps: Even with advanced methods, assemblies often contain gaps (regions where no reads could be unambiguously assembled).

Genome Closing

Producing a fully closed genome is challenging, especially with short reads.
Long-read platforms like PacBio and Nanopore are preferred for generating full-length sequences that enable complete genome assembly.
Additional techniques like optical mapping (BioNano) and chromatin association (Hi-C) are highly recommended to validate and improve the integrity and contiguity of assemblies to a chromosome level.

Genome Annotation

The process of identifying and labeling biologically meaningful features within a raw DNA sequence

Identifying Noncoding Regions

Repetitive elements
Transposable elements
Noncoding RNAs (e.g., tRNAs, rRNAs)
Tools like RepeatMasker and NONCODE

Gene Prediction

Potential protein-coding genes.
Open Reading Frames (ORFs)
- Stretches of DNA that can be translated into amino acids without encountering a stop codon
- There are six possible reading frames
- Typically only one will be long enough to be a candidate gene.

Approaches

Ab Initio Prediction
- Computational models trained on known gene characteristics to predict genes and proteins without external evidence
Evidence-Based Prediction
- Aligning experimental data to the genome to identify gene structures
- RNA-seq can be used to identify protein-coding genes based on transcriptome data

Functional Annotation

Assigning biological function to the identified genes and other features.

Homology Search
- Comparing newly assembled sequences to databases of genes with known functions (e.g. with BLAST)
Gene Ontology (GO) Term Mapping
- Tools include AmiGO, Blast2GO, GO-FEAT, and eggNOG-Mapper
- DeepGOPlus uses deep learning to predict GO terms from protein sequences

Pathway Mapping
- Identifying biological pathways including genes of interest
Protein Databases
- Comparison with protein databases like UniProt and NCBI RefSeq

Annotation Tools and Platforms

Automatic Pipelines NCBI and Ensembl offer automatic genome annotation pipelines * Ideal for beginners due to their flexibility and speed

Specialized Servers
- RAST (Rapid Annotation using Subsystem Technology) is commonly used for prokaryotic genomes, providing ORF finding and annotation
- BlastKOALA and GhostKOALA (from KEGG) are also used for annotation
- GhostKOALA specializes in metagenomes

Software Suites
- Used for downstream analyses like core genome phylogeny
- Example: Prokka, rapid prokaryotic genome annotation
Deep Learning in Annotation
- DeepAnnotator: genome annotation using deep neural networks to classify protein sequences into functional categories
- DeepGOPlus: predict GO terms from protein sequences

Challenges in Annotation

Manual vs. Automatic
- Manual annotation is time-consuming and expensive but often necessary for accuracy
- Semi-automatic approach integrating different results is often preferred to balance speed and reliability
- Automatic methods are fast and cheap but can decrease confidence and reliability, often differing between tools

Outdated Annotations
- Gene annotations can become outdated, impacting downstream analyses like pathway enrichment
Non-Model Organisms
- Annotation is more challenging for species without extensive existing data or closely related reference genomes

Future Perspectives

Continuing Advancements

New technologies continue to reduce sequencing costs, and bioinformatics pipelines are constantly improving, making genome projects more accessible to smaller labs.

Deep Learning

Deep learning is a powerful machine learning technique that uses artificial neural networks to learn from large amounts of biological data. It is increasingly applied across bioinformatics for tasks like genome annotation, gene function prediction, and protein structure prediction, offering more accurate predictions than traditional methods.

DeepVariant for variant annotation
DeepAnnotator for functional annotation of proteins
DeepGOPlus for predicting protein function using Gene Ontology terms
AlphaGenome for all of the above