6 Sequence Alignment & Search
Acknowledgements
Gemini was used for brainstorming and organization while preparing these notes.
Review
Bioinformatics is an interdisciplinary field at the intersection of biology, mathematics, statistics and computer science.
Central dogma of molecular biology: DNA -> RNA -> Protein
Bioinformatics databases
- NCBI
- UniProt
- Protein Data Bank
- KEGG
Introduction
We have data, today we’ll discuss some ways of learning from those data.
- Sequence alignment
- Advanced database searches
Sequence Alignment Concepts
What is Sequence Alignment?
- Comparison of two or more biological sequences (DNA, RNA, or protein) to identify regions of similarity.
- These similarities may stem from functional, structural, or evolutionary relationships.
- Sequence alignments consider all nucleotides or amino acids, as well as deletions and insertions.
Why sequence alignment?
Identify homology and evolutionary relationships: This is the primary reason. We can meaningfully compare DNA and protein sequences if
- They are homologs - that is, they have a common ancestor
- They are the result of homoplasy - that is they share traits because of evolutionary pressures of a similar environment (Castoe et al. 2009; Castoe, De Koning, and Pollock 2010)
Predicting protein function and structure: By aligning a new sequence to known functional or structural homologs, we can infer its potential function or structure.
Designing Experiments: For example, identifying conserved regions is crucial for designing specific primers for PCR.
Key concepts and terminology
- Homology: a fundamental similarity based on common descent.1.
- It’s important to distinguish homology from mere sequence similarity, as two sequences can be similar by chance without being homologous.
Orthologs: Homologous genes (or proteins) that diverged following speciation events and typically retain the same function in different species
- Example: the hemoglobin gene in humans and mice
Paralogs: Homologous genes (or proteins) that diverged as a consequence of gene duplication within the same species
- In prokaryotes, multiple copies of a gene with the same function are usually not tolerated, so paralogs often evolve to have different functions.
Matches: Identical characters (nucleotides or amino acids) at corresponding positions in aligned sequences
Mismatches: Different characters at corresponding positions
Gaps / Indels: Insertions or deletions of characters in one sequence relative to another
- Indels are indicated in alignments by hyphens (-)
Scoring matrices / substitution matrices: Tables that assign numerical scores to matches and mismatches based on how likely one character is to substitute for another over evolutionary time
- BLOSUM (BLOcks SUbstitution Matrix) and PAM (Percent Accepted Mutation) are common scoring matrices used for protein alignment.
- BLOSUM matrices are based on local alignments of conserved protein blocks.
- PAM matrices are based on observed mutations over a given evolutionary distance.
Gap Penalties: Scores subtracted from the total alignment score when introducing or extending a gap
- An insertion or deletion event is generally less frequent than a single point mutation.
Pairwise sequence alignment
Comparison of two sequences to determine their degree of similarity can help infer potential evolutionary or functional relationships.
Dynamic programming algorithms
These methods guarantee finding the mathematically optimal alignment given a specific scoring system.
- Needleman-Wunsch Algorithm: Used for global alignments
- Aligns sequences across their entire length
- Suitable when you expect the two sequences to be homologous over their full extent
- Example: Comparing two bacterial 16S rRNA genes from closely related species, where the entire gene is expected to have evolved from a common ancestor without large rearrangements
- Smith-Waterman Algorithm: Used for local alignments
- Identifies the most similar regions within two sequences, even if the overall sequences are very different.
- Example: Finding a conserved protein domain within two otherwise divergent proteins
- Example: Identifying a short gene fragment within a larger genomic sequence.
Heuristic algorithms
While dynamic programming is exhaustive and guarantees optimality, it is computationally very intensive for long sequences and impractical for searching large databases.
- Greedy algorithms: Pick the best local choice at each step and hope for a good overall solution.
- Example: BLAST is a multi-step algorithm that has greedy components.
- Simulated annealing: Start with a (usually random) solution and improve by making small changes.
- Used for optimization problems
- Will occasionally accept a worse solution to avoid local optima
- Example: This is an efficient approach used in multiple sequence alignment (MSA)
Searching databases with BLAST
BLAST (Basic Local Alignment Search Tool) is arguably the most frequently used bioinformatics program.
Purpose: Rapidly compare a query sequence against a large database of sequences to find regions of local similarity.
BLAST is a heuristic algorithm based on local alignments
- Uses the k-tuple/word method to achieve its speed
- Does not guarantee the mathematically optimal alignment, but it provides a very good approximation quickly.
BLAST overview
- Word list compilation
- Query sequence is broken down into small, overlapping “words” (k-tuples) of a specified length
- Database search
- The target database is quickly scanned for these words for exact or near-exact matches
- High-Scoring Pair (HSP) extension
- Matches are extended in both directions from that “seed” to find an HSP.
- Penalties are added for mismatches, allowing some differences.
- The extension stops when the score of the alignment drops below a certain threshold.
- Statistical significance:
- Statistical significance of these HSPs is evaluated to measure the likelihood of true biological relationships (as opposed to random matches).
BLAST implementations

- BLASTn: Compares a nucleotide query sequence against a nucleotide sequence database. *Example: Checking if your newly sequenced bacterial gene is present in GenBank
- Example: Identifying a microorganism based on its 16S rRNA gene sequence
- BLASTp: Compares a protein query sequence against a protein sequence database.
- Example: Finding homologous proteins in other organisms for a newly predicted protein sequence
- BLASTx: Translates a nucleotide query sequence into all six possible reading frames and then compares these translated protein sequences against a protein sequence database.
- Example: Identifying potential protein products and their functions from an unannotated DNA sequence.
- tBLASTn: Compares a protein query sequence against a dynamically translated nucleotide sequence database in all six frames.
- Example: Finding potential genes in an unannotated genome that might encode a known protein.
Interpretation
- E-value: Indicates the number of alignments with scores as good as or better than the observed score that are expected to occur by chance in a database of a given size
- A smaller E-value indicates a more statistically significant match and a higher likelihood of true homology.
- E-values less than 0.001 or 1e-05 are typically considered significant.
- Score: Represents the raw alignment score, normalized to be independent of the scoring matrix and database size.
- Higher scores indicate better alignments.
- Percent Identity: Percentage of identical characters between the query and subject sequences within the aligned region
BLAST Output Sections
- Graphic Summary: visual representation of hits
- Descriptions: summary table of hits
- Alignments: detailed pairwise alignments between your query and each subject sequence
- When performing a protein search, BLAST will search one sequence per representative cluster.
- Clusters typically have with >90% sequence similarity across the cluster.
- Cluster composition lists the number of sequences and species in the cluster
A common strategy to identify orthologous genes between two species is the Reciprocal Best Hits (RBH) approach. This helps filter out paralogs which might be highly similar but don’t represent the direct evolutionary counterpart.
- If gene A in species 1 is the best BLAST hit for gene B in species 2
- AND gene B in species 2 is the best BLAST hit for gene A in species 1
- then A and B are likely orthologs.
Multiple Sequence Alignment (MSA)
While pairwise alignment compares two sequences, Multiple Sequence Alignment (MSA) simultaneously aligns three or more biological sequences.
Applications
- Identify conserved regions: Reveals patterns of conservation (e.g., active sites, structural motifs, regulatory elements) across a gene or protein family.
- These conserved regions often indicate functional importance.
- Infer evolutionary relationships (phylogeny): MSAs are a prerequisite for constructing robust phylogenetic trees
- Phylogenetic trees illustrate the evolutionary history and relationships among a set of sequences.
- This is especially important in microbiology for classifying microorganisms based on their 16S rRNA genes.
- Improve gene prediction and functional annotation: Conserved regions in an MSA can help improve the accuracy of predicting gene structures and assigning functions to new sequences.
Challenges
MSAs are computationally more intensive than pairwise alignments, especially as the number and length of sequences increase.
Common algorithmic approach
Most MSA tools use a progressive alignment strategy, which involves building the alignment iteratively:
Perform all pairwise alignments among the sequences.
Construct a “guide tree” (e.g., using Neighbor-Joining) based on these pairwise similarities, reflecting approximate evolutionary relationships.
Align the most similar sequences first, then progressively add more divergent sequences or groups of sequences according to the guide tree.
Clustal
One of the most widely known and used progressive alignment tools.
ClustalW and ClustalX: Popular desktop versions with graphical user interfaces.
ClustalOmega: A newer, faster, and more scalable version for larger datasets.
Other Popular Tools:
There are other widely used tools that often provide faster or more accurate alignments, especially for large and diverse datasets.
- MUSCLE (MUltiple Sequence Comparison by Log-Expectation)
- MAFFT
Visualizing MSAs
Software like BIOEDIT or web tools (e.g., from EBI) allow you to view MSAs, highlighting conserved residues, mismatches, and gaps.
References
https://www.dictionary.com/browse/homology↩︎