13  Structural bioinformatics and protein modeling

Author
Affiliation

Dr Randy Johnson

Hood College

Published

October 22, 2025

Acknowledgments

Gemini and NotebookLM were used to find and summarize reference material included in these notes.

Introduction

  • Structural Bioinformatics specifically focuses on
    • Understanding protein structure

    • Modeling interactions

    • Exploring function relationships

Protein structure

  • Proteins are essential molecules performing vital roles

    • Providing structural scaffolding
    • Participating in metabolism
    • Biosignaling
    • Gene regulation
  • The three-dimensional (3D) structure of a protein is fundamentally linked to its function

  • Understanding the structure provides insight into protein function

Protein Structure Levels

  • Primary structure
    • The unique sequence of amino acids
  • Secondary structure
    • Local folding patterns, typically comprising \(\alpha\)-helices and \(\beta\)-sheets
    • Loop and turn regions link these structures
  • 3D structure (tertiary/quaternary)
    • Complete folding pattern of the protein
    • Essential for analyzing active sites and investigating molecular interactions

Image courtesy of the Cleveland Clinic

Protein Structure Databases and Visualization

Primary structure databases

  • GenBank (Genetic Sequence Databank)
    • Nucleotide sequences with a flat file structure
    • Files are ASCII text files that can be read by humans and computers
    • Contain data such as accession numbers, gene names, and phylogenetic classification
    • GenBank is maintained by the NCBI

Secondary structure databases

  • UniProt (Universal Protein Resource)
    • Protein sequences
    • Functional annotations
    • Structural details
  • Swiss-Prot
    • Reviewed and manually annotated section of UniProt
    • Higher quality data
    • Recognized by a golden star in UniProt

3D structure databases

  • Protein Data Bank (PDB)
    • Database for 3D structures of biological macromolecules, including proteins and nucleic acids
    • Typically experimentally determined using methods like X-ray crystallography or NMR spectroscopy
    • Hosted by the Research Collaboratory for Structural Bioinformatics (RCSB).

Structure quality and validation

  • Ramachandran plot
    • Tool used to check the quality of protein structures
    • This plot validates the main chain of amino acids by comparing observed versus expected \(\phi\) (phi) and \(\psi\) (psi) angles, helping to identify potential errors or stereochemical impediments
    • The Bumbling Biochemist has a good video discussing these (the bumbling biochemist 2023)

Image courtesy of Proteopedia
  • Structure Visualization Tools
    • PyMOL
      • Tool for protein modeling
    • Swiss-PdbViewer (DeepView):
      • Analysis of multiple proteins simultaneously
      • Includes superposition
      • Can also be used to examine the Ramachandran plot

Computational Protein Modeling Methods

  • Only a fraction of known proteins have had their 3D structure solved experimentally, which is costly and time-consuming

  • Computational methods are critical for most porteins for

    • Functional understanding
    • Drug discovery

Modeling strategies

  • Template-Based Modeling (TBM)
    • Builds the structure of a target protein based on the known 3D structure (template) of a homologous protein
    • Assumpes that proteins with similar sequences fold into similar structures
    • Most effective when the target and template share more than 30% to 40% sequence similarity
  • De novo modeling (Ab Initio)
    • Used when a suitable template structure is not available
    • Estimates protein’s structure from scratch
    • Calculates the most energetically favorable conformation based on chemical and physical principles

TBM prediction process

  1. Reference Identification:
    • Identify similar amino acid sequences whose structures are resolved
    • Often using tools like BLAST
  2. Template Selection:
    • Choose one or more suitable template structures
    • Consider factors like sharing the same function or belonging to the same protein family
  1. Alignment:
    • Align the target and template sequences
    • Alignment algorithms use
      • Scoring matrices like BLOSUM and PAM
      • Gap penalties for insertions or deletions
  2. Construction:
    • Transfer the template’s coordinates to the target protein to generate the atomic coordinates of the 3D model
  1. Model Validation:
    • Verify the quality of the model by checking for errors, such as
      • Unsuitable torsion angles
      • Steric hindrance
    • Ramachandran plots can be helpful

Modeling tools and servers

  • SWISS-MODEL:
    • Automated homology modeling server that builds 3D models from amino acid sequences
    • Can model oligomeric structures and include essential cofactors or ligands
    • Uses sensitive Hidden Markov Models (HMM) to search for suitable templates
  • Phyre2 (Protein Homology/AnalogY Recognition Engine 2):
    • Web-based program that employs homology modeling to predict structure
    • Finds related proteins in the PDB
    • Uses ab initio modeling when templates are lacking
  • I-TASSER (Iterative Threading Assembly Refinement):
    • Hierarchical system designed for automated protein structure prediction and structure-based function annotation
    • Finds templates using LOMETS (a meta-threading algorithm)
    • Performs assembly and refinement through techniques like replica-exchange Monte Carlo simulations
  • Robetta:
    • Integrates comparative modeling with de novo folding techniques
    • Uses Rosetta fragment insertion method
  • AlphaFold 2 (Google DeepMind):
    • Employs a novel architecture that learns complex relationships between amino acids
    • Predicts their distances and angles in 3D space

Modeling Interactions and Functional Prediction

Protein-protein interaction (PPI) modeling

  • Fundamental, governing biological processes like

    • Cell signaling
    • Enzyme activity
    • Gene expression control
  • Bioinformatics tools are used to anticipate and simulate PPIs

  • Identify protein complexes and binding sites

Docking methods for PPIs

  • Docking is used to model PPIs, predicting the 3D structure of protein complexes
  • Rigid Body Docking assumes that proteins maintain their structure upon binding
  • Energy-Based Docking employs molecular mechanics force fields and scoring functions to estimate the binding free energy
  • Machine learning algorithms can be trained on known PPI data to predict binding affinities and accelerate simulations.

Molecular dynamics (MD) simulations

  • Provide a thorough perspective on protein movement, dynamics, and conformational changes during binding

  • Atomistic MD Simulations mimic the behavior of individual atoms over time using classical force fields

  • Coarse-Grained (CG) MD Simulations

    • Simplify the molecular representation to reduce computational complexity
    • Allows for the simulation of larger complexes over longer timescales

Interaction databases

  • STRING (Search Tool for the Retrieval of Interacting Genes/Proteins):
    • Provides details regarding quality-controlled protein-protein association networks
  • MINT (Molecular INTeraction Database):
    • A repository containing experimentally supported PPIs
  • Cytoscape:
    • Open-source software used for visualizing and analyzing intricate PPI networks

Functional annotation and pathway analysis

  • Gene Ontology (GO):
    • Hierarchical, standardized vocabulary describing protein functions across three categories:
      • Cellular Component
      • Biological Process
      • Molecular Function
  • InterPro: Combines data from several sources to predict
    • Protein domains
    • Families
    • Functional sites by combining data from several sources (e.g., Pfam, PROSITE)
  • KEGG (Kyoto Encyclopedia of Genes and Genomes):
    • Database providing pathway maps and combining genetic, functional, and chemical domains
  • ScanNet:
    • A geometric deep-learning model designed to learn features directly from protein structures
    • Can be used to predict functional sites, such as binding sites for small molecules

Summary

  • Bioinformatics facilitates
    • Structure prediction
    • Simulating interactions
    • Clarifying functional relationships
    • Investigation of protein-protein interactions
    • Understanding function
  • Predictive accuracy of protein structure, exemplified by tools like AlphaFold, is rapidly advancing

Future Directions

  • AI, AI, AI …
    • DL models capable of integrating information across molecular, cellular, tissue, and organism levels is expected to lead to a more complete understanding of biological processes
  • Research direction: Integrating multi-omics data
the bumbling biochemist. 2023. “Understanding & Exploring & Ramachandran Plots & Phi/Psi Angles w/o Getting Bogged down in Details.” https://www.youtube.com/watch?v=QGBxXHqdC9E.