19  NGS analysis

Author
Affiliation

Dr Randy Johnson

Hood College

Published

September 3, 2025

Introduction

Today we will be looking at some data published by Tenaillon et al. (2016). In this paper, the authors analyze 264 complete genomes from 12 Escherichia coli populations over 50,000 generations to characterized the dynamics of their evolution. The experiment, known as the Long-Term Evolution Experiment (LTEE), has been running since 1988 in a defined, glucose-limited medium. There is an interesting video by Veritasium about this experiment if you would like to learn more.

The specific data files we’ll be analyzing were curated for teaching as part of the Data Carpentry Genomics Workshop. The full data can be found elsewhere, but these files are small enough to analyze in class.

Set up

FASTQ

The first file we’ll look at is a fastq file. Here are the first few reads:

@SRR097977.1 209DTAAXX_Lenski2_1_7:8:3:710:178 length=36
TATTCTGCCATAATGAAATTCGCCACTTGTTAGTGT
+SRR097977.1 209DTAAXX_Lenski2_1_7:8:3:710:178 length=36
CCCCCCCCCCCCCCC>CCCCC7CCCCCCACA?5A5<
@SRR097977.2 209DTAAXX_Lenski2_1_7:8:3:365:371 length=36
GGTTACTCTTTTAACCTTGATGTTTCGACGCTGTAT
+SRR097977.2 209DTAAXX_Lenski2_1_7:8:3:365:371 length=36
CC:?:CC:?CCCCC??C?:?C-&:C:,?<&*?+7?<

The first two lines are similar to fasta format:

  • “@” followed by a sequence ID and metadata
  • Then the nucleotide sequence

The next two lines are additional quality score information:

  • ‘+’ follwed by the Sequence ID and metadata
  • Quality scores (see this page for score tranlsation)
    • Remember: Q = -10 log10(p), where ‘p’ is the probability of an incorrect base call

Then we pick up with the next sequence. Let’s break down those metadata:

  • @: signals the start of a new sequence
  • SRR097977.1: is the first sequence for sample SRR097977
  • 209DTAAXX_Lenski2_1_7: This portion of the header provides details about the sequencing experiment and sample
    • “209DTAAXX” is likely the flowcell identifier
    • “Lenski2” probably refers to a specific sample or population from the Lenski LTEE
    • “1_7” could denote the lane and index on the flowcell
  • 8:3:710:178: These numbers represent the coordinates of the read on the flowcell, which are useful for diagnostic purposes
    • 8 is the lane number
    • 3 is the tile number within the lane
    • 710 is the x-coordinate of the cluster
    • 178 is the y-coordinate of the cluster
  • length=36: This indicates that the sequencing read is 36 base pairs long. This information is redundant because the length can also be determined from the sequence line itself, but it is often included for convenience.

FastQC

SRR097977.fastq

  • Log into Galaxy
  • Upload SRR097977.fastq
  • Find and run the FastQC tool under the ‘tools’ menu
    • Basic Statistics: Provides a high-level overview of the sequencing run, including the file name, file type, number of reads, and total sequence length.
    • Per Base Sequence Quality: A graph that shows the quality score distribution at each position in the read. High-quality data will have all box plots in the green zone, indicating a mean quality score above 28.
    • Per Tile Sequence Quality: Identifies any quality problems that may be localized to specific areas of the flowcell, such as a tile with a low-quality score. This can indicate issues with the sequencing instrument.
    • Per Sequence Quality Scores: Shows the distribution of average quality scores over all reads. A good run will have a tight peak towards the high end of the quality scale.
    • Per Base Sequence Content: Plots the proportion of each nucleotide (A, C, G, T) at each position. In random libraries, the lines should be flat and close together. A deviation, especially at the start of the reads, can indicate issues like overrepresented sequences or adapter contamination.
    • Per Sequence GC Content: Displays the GC content distribution over the entire dataset. This should ideally form a normal distribution that matches the expected GC content of the organism being sequenced.
    • Per Base N Content: Tracks the percentage of bases that are identified as “N” (meaning the sequencer could not determine the base) at each position. A high percentage of Ns indicates a problem with the run.
    • Sequence Length Distribution: Shows the distribution of read lengths. For a run with reads of a fixed length, this will be a single peak.
    • Sequence Duplication Levels: Measures the percentage of reads that are identical. A high duplication level in a genomic DNA library can indicate PCR bias or a low diversity library.
    • Overrepresented Sequences: Lists sequences that appear with an unusually high frequency. This can be caused by contamination from sequencing adapters, primers, or other artifacts.
    • Adapter Content: Checks for the presence of common sequencing adapters. A good run will have a low or zero percentage of adapter sequences.
    • K-mer Content: Reports on short, overrepresented sequences of a specified length (k-mers). A high level of a specific k-mer can indicate contamination or a bias in the library preparation.

SRR2584863_1.trim.sub.fastq

  • Visually inspect the first few lines of this file.
  • What differences to you see in this set?
  • Run FastQC on SRR2584863_1.trim.sub.fastq
    • What additional differences do you see?

References

Tenaillon, Olivier, Jeffrey E. Barrick, Noah Ribeck, Daniel E. Deatherage, Jeffrey L. Blanchard, Aurko Dasgupta, Gabriel C. Wu, et al. 2016. “Tempo and Mode of Genome Evolution in a 50,000-Generation Experiment.” Nature 536 (7615): 165–70. https://doi.org/10.1038/nature18959.