27 Final Review

Author

Affiliation

Dr Randy Johnson

Hood College

Published

November 19, 2025

Acknowledgements

The draft version of these notes was an AI summary of the weekly lecture notes that will be covered in the final exam.

Week 8 Review: NGS Assembly & Annotation

This week focused on the critical computational steps necessary to convert raw NGS reads—millions to billions of short or long DNA fragments—into biologically meaningful information: Genome Assembly and Genome Annotation.

Initial Data Processing

Step	Purpose	Key Tools/Concepts
Base Calling	Converts raw sequencing signals into the A, T, C, G bases.	Assigns a Phred score (quality score) to each base.
Trimming & QC	Removes low-quality bases, adapter sequences, and filters out bad reads to reduce errors.	FastQC (Quality Control), Trimmomatic (Trimming).

Genome Assembly

The process of piecing together short reads to reconstruct the complete DNA sequence (contigs and scaffolds) of the organism.

De Novo Assembly

Definition: Reconstructing a genome without a pre-existing reference. Essential for new organisms or capturing large structural variations.

Reference-Guided Assembly

Definition: Aligning reads to a known, closely related reference genome.
Trade-off: Simplifies the process and reduces computation, but may introduce bias from the reference.

Assembly Challenges

Repetitive Sequences: Difficult to resolve, leading to fragmentation or misassemblies.
Heterozygosity/Ploidy: Differences between chromosome copies can be collapsed into a single, less accurate consensus.
Gaps: Regions with ambiguous or missing reads.

Genome Closing & Validation

Achieving a fully complete (closed) genome often requires:

Long-Read Platforms: PacBio and Nanopore reads can span repetitive regions.
Validation Techniques: Optical mapping (BioNano) and chromatin association (Hi-C) improve contiguity and integrity, helping move assemblies toward chromosome-level resolution.

Genome Annotation

The process of identifying and labeling the functional elements within the assembled DNA sequence.

Feature Identification

Noncoding Regions: Identifying repetitive elements, transposable elements, and noncoding RNAs.
Gene Prediction: Identifying potential protein-coding genes based on Open Reading Frames (ORFs), long stretches without a stop codon.
- Ab Initio Prediction: Uses computational models trained on known gene characteristics.
- Evidence-Based Prediction: Uses experimental data (e.g. RNA-seq) to confirm gene structures.

Functional Annotation

Assigning biological context and function to the identified features:

Homology Search: Comparing sequences to known genes/proteins using tools like BLAST.
Gene Ontology (GO) Term Mapping: Assigning standardized, structured terms describing function, process, and location.
Pathway Mapping: Linking genes to known biological pathways.
Protein Databases: Comparison against databases like UniProt and NCBI RefSeq.

Annotation Challenges

Balancing Manual Accuracy (time-consuming, expensive) with Automatic Speed (fast, cheap, but less reliable).
Annotations for non-model organisms and outdated annotations also pose difficulties.

Week 9 Review: Bulk RNA-seq

This week covered Transcriptomics—the study of all RNA transcripts (the transcriptome) in a cell at a given time—and the primary modern method for this analysis, Bulk RNA Sequencing (RNA-seq).

The Basis of Transcriptomics

Central Dogma: Information flows from DNA \(\rightarrow\) RNA \(\rightarrow\) Protein. Measuring mRNA levels is a key way to estimate gene activity.
Relevance: Transcriptomics reveals the regulation status of all genes simultaneously, providing insight into developmental stages, disease, or physiological conditions (e.g. antibiotic resistance mechanisms).
Prokaryotic vs. Eukaryotic Genes:
- Prokaryotic: Often in operons, transcribed into polycistronic mRNA (one mRNA codes for multiple genes).
- Eukaryotic: Transcribed as single gene units, contain non-translated introns and translated exons.

Evolution of Gene Expression Analysis

Traditional methods were limited because they relied on specific probes, meaning “you only find what you’re looking for.”

RNA Sequencing (RNA-seq)

RNA-seq uses high-throughput sequencing to provide an unbiased, genome-wide view of transcription, replacing hybridization techniques.

RNA-seq Applications

Differential Gene Expression (DGE)
Transcript discovery (e.g. lncRNAs, microRNAs)
Differential alternative splicing and isoform detection
Allele-specific expression and RNA editing
Genome annotation

Experimental Design Considerations

Factor	Considerations
Biological Replicates	Required (at least 3) to achieve statistical power and distinguish true differences from random variation.
Sequencing Strategy	Paired-end improves mapping accuracy, especially for low-expressed genes. Single-end is less expensive.
Read Depth/Coverage	Higher depth ensures better detection of lowly expressed genes. More replicates can sometimes be preferred over higher depth.
Platforms	Illumina (short-read) is common for bulk RNA-seq. 10x Genomics enables single-cell RNA-seq. Long-read platforms detect alternative splicing.

Preparing an RNA-seq Library (Example: Tru-seq Protocol)

The main goal is to isolate mRNA, convert it to cDNA, and prepare it for sequencing.

RNA QC: Start with high-quality total RNA.
rRNA Depletion: rRNA must be removed or avoided, as it constitutes the majority of total RNA.
- Poly-A Selection: Used for eukaryotic mRNA (which have poly(A) tails).
- Ribo-depletion: Used for whole RNA sequencing or prokaryotic RNA.
Fragmentation: RNA is broken into smaller pieces.
cDNA Conversion: RNA is converted into complementary DNA (cDNA).
Adapter/Barcode Addition: Sequencing adapters and unique barcodes (indexes) are added for multiplexing.
PCR Amplification and final Quality Control.

Data Analysis Workflow

QC, Mapping, and Counting

QC & Trimming: Use FastQC to check quality, trimmomatic or Cutadapt to trim bad reads/bases, and re-run QC.
Mapping: Align reads to the genome/reference using aligners like STAR.
Processing: Convert output (e.g. to BAM file) using samtools, then sort and index.
Quantification: Use tools like subread to generate a count matrix (integer counts per gene/individual).

Expression Quantification and DGE

Normalization: Required to account for sequencing depth bias and compositional bias.
Differential Gene Expression (DGE): Quantifies differences in gene expression between groups (e.g. case vs. control).
- Statistical Analysis: Programs like DESeq2 and EdgeR (from Bioconductor) are commonly used.
Multiple Hypothesis Testing: Necessary because thousands of statistical tests are performed simultaneously.
- eBayes, Benjamini–Hochberg (FDR) and Bonferroni are common methods to control false positives.

C. Functional Annotation & Visualization

Functional Annotation: Uses Gene Ontology (GO) terms to relate genes to biological function, process, or location.
Enrichment Analysis: Identifies over-represented gene sets that share common functions or pathways.
Dimensionality Reduction: Techniques to visualize global patterns and clustering in the data:
- PCA (Principal Component Analysis): Assumes linear relationships.
- tSNE / UMAP: Allow for non-linear relationships, typically used for massive datasets like scRNA-seq.
Other Visualization: Heatmaps, Boxplots, and network visualization tools like Cytoscape for integrating pathways and protein interaction data.

Challenges: Annotation challenges persist due to outdated databases, lack of common data formats, and the complexity of converting protein/gene names (many-to-many relationships), often necessitating manual review for accuracy.

Week 10 Review: Differential Gene Expression (DGE)

This week detailed the standard bioinformatics workflow for performing Differential Gene Expression (DGE) analysis on RNA-seq count data, primarily utilizing the Bioconductor packages edgeR, limma, and Glimma.

Workflow Overview and Key Tools

The DGE pipeline is a sequential process that transforms raw counts into statistically significant results.

Section	Description
Data Wrangling	Importing raw counts and sample metadata.
Pre-processing	Removing low-expression genes and normalizing data.
Exploratory Analysis	Visualizing sample relationships (clustering).
Differential Expression	Modeling mean-variance relationship and fitting linear models.
Interpretation	Summarizing and visualizing significant results.

Data Wrangling and Pre-processing

Data Wrangling

Raw gene counts (genes \(\times\) samples) and associated sample metadata are imported.
Data is stored in a data object, which efficiently holds counts and metadata.

Data Pre-processing

Filtering: Genes with consistently low expression across all samples are filtered out to increase statistical power and reduce the multiple testing burden.
Normalization: Required to account for:
- Sequencing Depth Bias: Differences in total read counts (library size) between samples.
- Compositional Biases: Highly expressed genes skewing the total count, making other genes appear artificially lower.
Trimmed Mean of M-values (TMM): The preferred normalization method; calculates scale factors so that the log-fold-changes between samples are, on average, centered at zero.

Exploratory Analysis and Design

Multi-Dimensional Scaling (MDS) Plot: A visualization (similar to PCA) used to check the overall relationships between samples.
- Expected Result: Samples should cluster tightly by biological group.
- Warning Sign: Samples clustering by a non-biological variable (like Batch/Lane) indicates a potential batch effect that needs to be modeled.
Design Matrix: A matrix that codes the variables of interest (experimental groups) and any confounding factors (e.g. Batch) that must be included in the linear model.
Contrasts: Define the specific pairwise comparisons of interest (e.g. “LP vs. Basal”).

Differential Expression Modeling

RNA-seq count data is typically heteroscedastic, meaning a gene’s variance depends on its mean (higher mean \(\rightarrow\) higher variance). Since standard linear models assume constant variance (homoscedasticity), a transformation is necessary.

Models the mean-variance relationship of the log-counts and compute precision weights. This makes the data suitable for linear modeling.
Fit a linear model for every gene in the dataset using the design matrix and the voom weights.
Empirical Bayes moderates the gene-wise variance estimates across all genes, leading to more stable and reliable p-value calculations.

Reviewing Results and Visualization

Sort and summarize the results, showing:
- logFC: Log-Fold Change (magnitude and direction of expression change).
- AveExpr: Average log count per million expression.
- t: t-statistic.
- P.Value: Raw p-value.
- adj.P.Value (FDR): Adjusted p-value (False Discovery Rate) for multiple testing.
Mean-Difference (MD) Plot: Plots the logFC vs. Average Expression (AveExpr/log-CPM). It shows the overall pattern of DGE and highlights significant genes.
Interactive MD Plot: Allows users to interactively explore and search for specific genes on the plot.
Heatmaps: Used to visualize the expression levels of the top differentially expressed genes across all samples, showing sample clustering and expression patterns.

Week 12 Review: Proteomics and Mass Spectrometry

This week focused on Proteomics—the large-scale study, identification, and characterization of all proteins (the proteome) in a cell or organism—and the core technology that enables it, Mass Spectrometry (MS).

Introduction to Proteomics and Mass Spectrometry

Proteomics: The study of proteins, which are fundamental molecules involved in structure, metabolism, signaling, and gene regulation.
Why Study Proteins Directly? Transcriptome data (mRNA abundance) is often insufficient to reliably infer actual protein abundance due to complex post-transcriptional and post-translational regulatory mechanisms.
Mass Spectrometry (MS): A technology that revolutionized proteomics by offering high sensitivity. MS determines the mass-to-charge ratio (m/z) of ions.
- MALDI-TOF: A common high-throughput approach where samples are ionized, and the time it takes peptides to travel is used to determine their size (Time-of-Flight).
- Tandem Mass Spectrometry (MS/MS): Involves selecting peptides and fragmenting them by collision (MS2) to achieve the resolution needed for precise sequence determination.

Bottom-Up Proteomics Workflow

The most common strategy where intact proteins are broken down into smaller peptides prior to MS analysis.

Data Acquisition Steps

Proteolytic Digestion: Proteins are degraded into peptides. Trypsin is the most common enzyme, cutting at the C-terminus of Lysine or Arginine.
Fractionation: Digested peptides are separated, often using Liquid Chromatography (LC), to reduce sample complexity and make lower-abundance peptides easier to detect.
MS Analysis: Mass spectrometer determines the m/z of the peptides and their fragments.

Peptide and Protein Identification

The goal is to determine the amino acid sequence from the fragmentation spectra.

Database Matching: The primary method; in silico digestion of known protein sequences creates a target database.
De Novo Sequencing: Directly generating short sequence tags from the MS/MS spectra without relying on a database.
False Discovery Rate (FDR): Controlled using the Target-Decoy Strategy, where a search is also performed against a decoy (reversed/shuffled) database to estimate a statistical confidence threshold for removing low-confidence identifications.
Protein Inference Problem: Degenerate peptides (peptides shared by multiple proteins) result in ambiguity. Solutions include the Parsimonious Rule (using the smallest set of proteins to explain all detected peptides) and probabilistic models.

Protein Quantification Strategies

No single pipeline exists due to upstream experimental variation.

Method	Principle	Detection	Pros & Cons
Label-Free Quantification (LFQ)	Samples run separately. Quantification based on signal intensity or spectral counts of peptide precursor ions.	MS1	Pros: Cheaper, quicker, more sensitive, no multiplexing limit. Cons: Higher run-to-run variability.
Labeled: MS1-based (Isotope)	Different samples tagged with distinct isotopes (e.g., heavy/light). Samples pooled before digestion.	MS1	Double peak in MS1 provides abundance ratio. Co-elution prevents ratio compression.
Labeled: MS2-based (Isobaric/TMT)	Samples tagged with Tandem Mass Tags (TMT) (Reporter + Balance groups). Samples pooled before digestion.	MS2	Pros: High multiplexing (up to 11 samples). Cons: Prone to ratio compression if biologically uninteresting peptides co-elute.

Bioinformatics: Analysis and Interpretation

Data Preprocessing and Statistical Analysis

Preprocessing: Strict cutoffs for minimum peptide numbers must be applied for reliable protein inference.
Normalization: Essential for removing non-biological variations.
Missing Values: Common for low-concentration proteins due to stochastic sampling. Machine learning models are often used to impute missing data.
Statistical Analysis:
- Common Tests: T-test (two groups) and ANOVA (two or more factors).
- Small Sample Sizes: Empirical Bayes procedures are used to pool variance estimates, providing more robust results.
Multiple Hypothesis Testing: FDR must be controlled (e.g. using Benjamini–Hochberg) due to thousands of proteins being tested simultaneously.

Functional Enrichment Analysis

Relates identified proteins to biological function, process, and pathways.

Gene Ontology (GO) Enrichment: Uses a structured vocabulary to categorize proteins into three main categories: Biological Process, Molecular Function, and Cellular Component.
Pathway Analysis: Uses prior knowledge from databases like KEGG and Reactome to identify involved regulatory networks.
Protein Set Enrichment Analysis (PSEA): Used for sets of proteins that may share common characteristics or modifications.
PTM and PPI Databases: Used to curate information on post-translational modifications (PTMs) and protein-protein interaction (PPI) networks.

Future Directions

Machine Learning (ML/AI): Used for classification (e.g. disease subtypes), prediction (e.g. protein folding via AlphaFold, clinical outcomes), and unsupervised learning (inferring clusters/patterns).
Multiomics: Combining proteome data with genomics, transcriptomics, and metabolomics to gain a comprehensive view, often revealing differences between mRNA and protein levels (post-translational regulation).
Challenges: Precisely quantifying low abundance proteins and eliminating missing values remain difficult.

Week 13 Review: Structural Bioinformatics and Protein Modeling

This week explored Structural Bioinformatics, the specialized field focusing on understanding, modeling, and predicting protein structure, interactions, and function. The three-dimensional (3D) structure of a protein is fundamentally linked to its function.

Fundamentals of Protein Structure

Proteins are essential molecules, and their structure is described at four levels:

Primary structure: The unique linear sequence of amino acids.
Secondary structure: Local folding patterns, typically stabilized structures like \(\alpha\)-helices and \(\beta\)-sheets.
Tertiary structure (3D): The complete folding pattern of a single polypeptide chain.
Quaternary structure: The arrangement of multiple polypeptide chains (subunits) in a protein complex.

Structure Quality and Validation: The Ramachandran plot is a key tool used to check the quality of protein structures by validating the observed vs. expected \(\phi\) (phi) and \(\psi\) (psi) backbone torsion angles.

Protein Structure Databases and Visualization

Structure Level	Database Example	Content/Determination
Primary	GenBank (NCBI)	Nucleotide sequences (flat file structure).
Secondary	UniProt (Universal Protein Resource)	Protein sequences, functional annotations, and structural details (Swiss-Prot is the high-quality, manually annotated section).
3D (Tertiary/Quaternary)	Protein Data Bank (PDB)	Experimentally determined 3D structures of macromolecules (e.g. X-ray crystallography, NMR).

Visualization Tools: PyMOL is a widely used tool for protein modeling. Swiss-PdbViewer is used for analysis, including the superposition of multiple proteins and examining the Ramachandran plot.

Computational Protein Modeling Methods

Since experimental structure determination is costly and time-consuming, computational methods are essential for predicting the structure of most known proteins.

Modeling Strategy	Definition	When is it used?
Template-Based Modeling (TBM) (Homology)	Builds a target protein’s structure based on the known 3D structure (template) of a homologous protein.	Most effective when target and template share 30% to 40% sequence similarity.
De Novo Modeling (Ab Initio)	Estimates the protein’s structure from scratch by calculating the most energetically favorable conformation based on chemical and physical principles.	When a suitable template structure is not available.

TBM Prediction Process

Reference Identification: Identify similar sequences with resolved structures (e.g. using BLAST).
Template Selection: Choose the most suitable template(s).
Alignment: Align target and template using scoring matrices (e.g. BLOSUM, PAM) and gap penalties.
Construction: Transfer template coordinates to build the 3D model.
Model Validation: Verify quality by checking for errors like unsuitable torsion angles (Ramachandran plot).

Key Modeling Tools and Servers

SWISS-MODEL: Automated homology modeling server using HMMs to find templates.
Phyre2: Web-based program that employs homology modeling and ab initio modeling as a fallback.
I-TASSER: Hierarchical system combining threading (LOMETS) with assembly and refinement.
AlphaFold 2 (Google DeepMind): Employs deep learning to predict amino acid distances and angles in 3D space with high accuracy.

Modeling Interactions and Functional Prediction

Protein-Protein Interactions (PPI) and Dynamics

Docking Methods: Used to model PPIs and predict the 3D structure of protein complexes.
- Rigid Body Docking: Assumes structures don’t change upon binding.
- Energy-Based Docking: Uses force fields to estimate binding free energy.
Molecular Dynamics (MD) Simulations: Mimic the behavior and conformational changes of proteins over time.
- Atomistic MD: Mimics individual atoms using classical force fields.
- Coarse-Grained (CG) MD: Simplifies molecular representation to reduce complexity, allowing longer timescale simulations.
Interaction Databases:
- STRING: Provides quality-controlled protein-protein association networks.
- MINT: Repository of experimentally supported PPIs.
- Cytoscape: Software used for visualizing and analyzing intricate PPI networks.

Functional Annotation and Pathway Analysis

Gene Ontology (GO): Standardized vocabulary to describe function: Cellular Component, Biological Process, and Molecular Function.
InterPro: Combines data from multiple sources (Pfam, PROSITE) to predict protein domains, families, and functional sites.
KEGG: Database providing detailed pathway maps that integrate genetic, functional, and chemical domains.
ScanNet: A geometric deep-learning model that learns features directly from protein structures to predict functional sites.

Future Directions

The field is rapidly advancing, driven heavily by AI/Deep Learning (DL), which is capable of integrating information across molecular and systemic levels. The ultimate goal is Multiomics integration—combining proteome, genome, and transcriptome data—for a more complete understanding of biological processes.

Week 14 Review: Metabolomics and Pathway Analysis

This week introduced Metabolomics, the study and profiling of all small molecules (metabolites) within a biological system. The goal is to understand how cellular biochemistry is linked to conditions like disease status, genetic background, or environmental exposure.

Profiling Methods and Identification Challenges

Profiling Methods

Mass Spectrometry (MS): The most common method, often paired with separation techniques (e.g., UPLC-MS/MS, CE-TOF MS). MS-based approaches typically detect a larger number of compounds than standard NMR.
Nuclear Magnetic Resonance (NMR): Another key profiling technique.

Data Acquisition and Identification Uncertainty

Metabolite identification is a significant bottleneck. Untargeted Metabolomics annotates compounds based on physicochemical properties and spectral database similarity, leading to high uncertainty.

The Metabolomics Standards Initiative (MSI) proposes four levels of identification:

Level	Description	Confidence
Level 1	Identified using an authentic chemical standard.	Highest
Level 2	Putatively identified using spectral databases.	Moderate
Level 3	Probable/uncertain annotation.	Low
Level 4	Unknown compound.	Lowest

Assay Bias: Each analytical platform (e.g., UPLC-MS/MS) introduces chemical bias, favoring the detection of compounds with specific properties (e.g., fatty acids, glycans), thereby sampling only limited areas of the metabolic network.

Data Preparation

Raw metabolite abundance matrices require post-processing:

Imputation of missing values (e.g. using the minimum value divided by 2).
Log transformation to stabilize variance.
Auto-scaling (subtracting the mean and dividing by the standard deviation).

Pathway Analysis (PA) in Metabolomics

Pathway Analysis (PA) is essential for interpreting high-dimensional molecular data by grouping individual molecules into biologically functional units. PA finds associations between these pathways and specific phenotypes.

Pathway Analysis Methods

Over-representation Analysis (ORA): The most common PA approach. It identifies pathways that contain a statistically higher number of Differentially Abundant (DA) molecules than expected by chance.
Functional Class Scoring (FCS): e.g. Gene Set Enrichment Analysis (GSEA).
Topology-based methods: Network or graph analysis methods.

ORA Inputs

Pathway Collection: Obtained from databases like KEGG (Kyoto Encyclopedia of Genes and Genomes), Reactome, and BioCyc.
Differentially Abundant (DA) Metabolites: A list selected using a statistical threshold (e.g. p < 0.05 or specific log-fold change). Multiple testing correction (like Benjamini–Hochberg FDR or Bonferroni) must be applied.
Background/Reference Set: Contains all compounds realistically detectable by the assay (ideally, all identified compounds in the experiment).

Challenges and Best Practices

ORA is highly sensitive to input parameters, which can drastically change results.

Challenge	Impact on ORA	Best Practice
Background Set Selection	Using a generic, non-assay-specific background increases false-positive pathways.	Use an assay-specific background set (all metabolites identified in the assay).
DA Metabolite Selection	Arbitrary significance threshold impacts the number of significant pathways detected.	Apply rigorous multiple-testing correction for both DA metabolites and resulting pathways.
Metabolite Misidentification	Even low levels of misidentification can lead to Pathway Loss (false negatives) or Pathway Gain (false positives).	Improve ID quality (aim for MSI Level 1 or 2).
Database Incompleteness	Databases are constantly evolving, leading to potential inconsistencies across time.	Perform a Consensus approach using multiple databases (KEGG, Reactome, BioCyc).

Systems Biology and Multi-Omics Integration

Systems Biology: Views biological interactions as a large network, moving beyond the simplistic “one gene, one protein, one function” principle.
Multi-omics Integration: Combining metabolomics with other omics data (genomics, transcriptomics, proteomics) provides a more comprehensive view of biological systems.
Discordant Trends: Pointwise comparisons between transcriptomics and proteomics/metabolomics often reveal discordant trends, which are key indications of significant post-transcriptional and post-translational regulation mechanisms.
Machine Learning/AI: Increasingly used in multi-omics integration to reconstruct interaction networks and link molecular changes directly to cellular processes.

Week 15 Review: Metagenomics

This week covered Metagenomics, the analysis of all genetic material (genomes and genes) recovered directly from environmental samples to characterize the entire microbial community (the microbiome).

Targeted Approach: 16S rRNA Amplicon Sequencing

This approach is quick and cost-effective, focusing on taxonomic characterization.

16S rRNA Gene

Function: Part of the prokaryotic ribosome (~1500 bp long).
Structure: Contains both highly conserved regions (used as primer targets) and hypervariable regions (V1-V9). These variable regions correlate strongly with taxonomy and are amplified during sequencing.

Bioinformatics Pipeline (Tools: QIIME2, Mothur)

QC & Pairing: Raw reads are trimmed, and forward/reverse reads are merged into contigs.
Artifact Removal: Chimeras (sequences merged from two origins) and host/non-target DNA (mitochondrial, chloroplast) are filtered out.
Grouping Reads: Reads are grouped into Operational Taxonomic Units (OTUs), typically based on a 97% sequence similarity threshold, which roughly corresponds to the species level.
Taxonomic Assignment: OTUs are aligned against specialized 16S rRNA gene sequence databases (e.g. SILVA, Greengenes) for identification.

Diversity Metrics

Alpha (\(\alpha\))-Diversity: Measures diversity within a single sample group. Rarefaction analysis is used to check if sequencing depth was sufficient.
Beta (\(\beta\))-Diversity: Measures diversity/dissimilarity between sample groups. Principal Coordinates Analysis (PCoA) is a common dimensionality reduction technique used for visualization.

Comprehensive Approach: Full Shotgun DNA Metagenomics

This approach sequences all DNA in a sample, providing both taxonomic and functional characteristics.

Pipeline for Shotgun Data

Preprocessing & Filtering: Includes quality control, demultiplexing, and crucial removal of host DNA (by mapping raw reads to the host genome).
Assembly: Highly complex samples require specialized assemblers (e.g., metaSPAdes, MEGAHIT) to reconstruct fragments of individual genomes (known as Metagenome-Assembled Genomes, MAGs).
Annotation (Functional Assignment): Predicted Open Reading Frames (ORFs) (encoding proteins) are compared against functional databases (e.g. KEGG) to determine the functional potential of the community.

Single-Cell Techniques and Future Directions

Single-Cell Metagenomics: A rapidly developing field with no standardized methodologies. It aims to provide high granularity on individual cell function.
Challenges: scRNA-seq in metagenomics faces sparsity/dropout (a high fraction of observed zeros), which complicates statistical analysis.
Machine Learning/AI: Tools like MetaVelvet-DL use deep learning to improve the accuracy of de Bruijn graph partitioning, leading to better assembly and resolution of individual species.
Future: Focus is on improved sequencing platforms (Nanopore), enhanced computational tools (DL), and using Whole-Genome Sequencing (WGS) data for classification.