5 Intro to Bioinformatics

Author

Affiliation

Dr Randy Johnson

Hood College

Published

August 25, 2025

Acknowledgements

Creation of these notes included brainstorming and proof reading with Gemini.

Introduction to Bioinformatics

Bioinformatics is essentially big data analysis for biological data sets. It requires computational and statistical analyses in order to extract meaning from biological data. Since this is the case, bioinformatics can also be used to refer to the development of the software and methods which are used to understand biological data. Bioinformatics is an interdisciplinary field where biology, computer science, and statistics meet. (Genomics 2021)

Bioinformatics is an interdisciplinary field

Bioinformatics sits at the intersection of biology, computer science, mathematics and statistics.

Bioinformatics is in a constant state of growth

The field has experienced rapid development due to the exponential increase in both sequence information accumulated in databases and available computer processing power.

Why is Bioinformatics Important?

Some answers could include:

It supports the transition of biology into a data-driven science. This is especially important with advancements in technologies like next-generation sequencing and mass spectrometry.
Bioinformatics is essential for processing, analyzing, and interpreting the massive amounts of biological data generated, including genetic information, protein amino acid sequences, and protein structures.
It helps organize, retrieve, store, and integrate diverse protein data from various sources, making it accessible for comprehensive analysis.
Bioinformatics facilitates a deeper understanding of cellular mechanisms, disease progression, and the intricate relationship between genotype and phenotype.

The Central Dogma of Molecular Biology

Original concept: The central dogma of molecular biology is a theory stating that genetic information flows only in one direction (Crick 1958):

DNA transcription to RNA
RNA translation to protein

This has become a little less strict as we’ve come to recognize some exceptions to this rule. For example, some viruses contain only RNA. Also, prions are infectious proteins that have the ability to replicate without DNA or RNA. Some examples include:

RNA viruses
- HIV
- COVID-19
Prion diseases
- Mad cow disease
- Creutzfeldt-Jakob disease
- chronic wasting disease

DNA nucleotides

All DNA is made up of 4 different nucleotides:

Adenine (A)
Guanosine (G)
Thymine (T)
Cytosine (C)

https://www.daviddarling.info/encyclopedia/N/nucleotide.html

Transcription

During transcription, RNA polymerase attaches to the DNA strand at a promoter. RNA polymerase is often assisted by transcription factors, that help recruit RNA polymerase to specific promoter regions.

Translation

During translation, transcribed mRNA enters the cytoplasm and is translated into protein, one codon at a time, inside of a ribosomal complex until a stop codon is encountered on the mRNA strand.

Codons

As seen above, each triplet of nucleotides (a codon) specifies a particular amino acid, which are the fundamental building blocks of proteins. Side note: thymine is transcribed to uracil (U) when synthesizing mRNA.

Amino acid sequences

The prediction of protein sequences from DNA sequences in bioinformatics is based on this dogma

Information flows from DNA to protein, not the other way around.

While precise functional prediction is possible for only about one-third of coding genes from bacterial genomes (Price et al. 2018), bioinformatics tools can provide strong hypotheses for further experimental validation.

These amino acid sequences twist and fold into proteins that perform a multitude of biological functions. Here is a protein structure of human CCR5:

https://www.rcsb.org/structure/4MBS

CCR5 is the primary entry point of the HIV virus when infecting T cells.

Genome annotation

One major task in bioinformatics is genome annotation:

Identifying and labeling relevant features
Predicting coding genes
Predicting structural genes like ribosomal RNA (rRNA).

This process often begins with identifying open reading frames (ORFs), which are stretches of DNA capable of being translated into amino acids without encountering stop codons.

Functional annotation

Functional annotation further connects these predicted genes to their biological processes by assigning Gene Ontology (GO) terms across three functional categories:

Molecular function
Biological processes
Cellular components

Biological Databases

Due to the size and complexity of biological data, specialized databases are necessary for organizing and interacting with data:

Storage
Management
Searching & retrieval

FAIR data

For reliability, databases must be continuously updated and their documentation easily accessible. They are expected to adhere to the FAIR principles, making data:

Findable
Accessible
Interoperable
Reusable

Data sources

There are hundreds of biological databases
- Primary Databases: contain raw experimental data
- Secondary Databases: are derived from primary data
  - Added interpretation
  - Quality control
  - Reduced redundancy

Primary database examples

GenBank: The most widely used and well-known primary database for DNA sequences. It is managed by the National Center for Biotechnology Information (NCBI). GenBank is an open-access, annotated collection of publicly available nucleotide sequences.

European Nucleotide Archive (ENA) / EMBL: primary database for DNA sequences in Europe, it is integrated into ENA and hosted by the European Bioinformatics Institute (EBI-EMBL) in the UK.

DNA Data Bank of Japan (DDBJ): Hosted by the National Institute of Genetics (NIG) in Japan.

International Nucleotide Sequence Database Collaboration (INSDC): GenBank, ENA, and DDBJ collectively form the INSDC, collaborating to exchange deposited data daily in standardized formats.

Sequence Read Archive (SRA): A primary database from NCBI for storing raw sequence reads generated by high-throughput experimental techniques.

Secondary databases

UniProt (Universal Protein Resource): A major secondary protein database that includes highly curated entries (known as “Swiss-Prot”) and computationally derived ones (TrEMBL). It offers comprehensive protein sequences, functional annotations, and structural/domain information.

Protein Data Bank (PDB): A repository for experimentally determined 3D structures of proteins and nucleic acids, typically obtained through methods like X-ray crystallography or NMR spectroscopy.

InterPro: An integrated resource that combines data from various sources (e.g., Pfam, PROSITE, PRINTS, SMART) to predict protein domains, families, and functional regions.

KEGG (Kyoto Encyclopedia of Genes and Genomes): A comprehensive database that integrates genetic, chemical, and functional information, providing pathway maps and functional annotations for genes and proteins.

Navigating and retrieving data

Accessibility: Most major bioinformatics databases like NCBI, UniProt, and PDB are freely accessible via web interfaces.
Search Mechanisms:
- Users can search these databases using keywords or specific accession numbers, which are unique identifiers for sequences.
- Many (most?) databases also have APIs that can be accessed using R and python.

FASTA format

A very common and simple text-based format for representing nucleotide or amino acid sequences.

First line of a sequence starts with >, followed by a sequence identifier/name
Subsequent lines contain the actual sequence

For example, here is an amino acid sequence for the human CCR5 gene.

>NP_000570.1 C-C chemokine receptor type 5 [Homo sapiens]
MDYQVSSPIYDINYYTSEPCQKINVKQIAARLLPPLYSLVFIFGFVGNMLVILILINCKRLKSMTDIYLL
NLAISDLFFLLTVPFWAHYAAAQWDFGNTMCQLLTGLYFIGFFSGIFFIILLTIDRYLAVVHAVFALKAR
TVTFGVVTSVITWVVAVFASLPGIIFTRSQKEGLHYTCSSHFPYSQYQFWKNFQTLKIVILGLVLPLLVM
VICYSGILKTLLRCRNEKKRHRAVRLIFTIMIVYFLFWAPYNIVLLLNTFQEFFGLNNCSSSNRLDQAMQ
VTETLGMTHCCINPIIYAFVGEKFRNYLLVFFQKHIAKRFCKCCSIFQQEAPERASSVYTRSTGEQEISV
GL

BLAST

The Basic Local Alignment Search Tool is the most frequently used bioinformatics program for comparing a query sequence (DNA or protein) against all sequences in a chosen database.

BLAST is a fast and reliable heuristic algorithm for finding regions of sequence similarity (homologs).

Its output includes a statistical value called the E-value, which is a measure of the probability that an alignment occurred purely by random chance.

Applications of bioinformatics in molecular biology

More than just data management - powerful computational methods to investigate
- protein structures
- protein-protein interactions
- complex relationships between structure and function

Key Application Areas

Protein ID: processing mass spectrometry data to identify and quantify peptides and proteins.
- MASCOT
- SEQUEST (Eng, McCormack, and Yates 1994)
- MaxQuant
- MSFragger
- PEAKS

Protein structure: prediction of three-dimensional protein structures from amino acid sequence.
- AlphaFold
- Phyre2
- Swiss-Model
- I-TASSER
- Robetta

Protein-Protein Interaction (PPI): computational docking methods predict the 3D structures of protein complexes, often informed by experimental data.
- STRING
- BioGRID
- I2D

Functional annotation and relationships: identify conserved domains, functional sites, and motifs based on sequence and structural analysis.
- Gene Ontology (GO)

Drug discovery:
- Virtual screening
- Molecular dynamics simulations
- Ligand docking
- Prediction of mutation impact on therapeutic effectiveness

Disease diagnostics: analysis of diverse data types (e.g., clinical symptoms, laboratory test results, medical images, genomic sequences) to make accurate disease predictions.

Gene expression analysis: analysis of differentially expressed genes using RNA-Seq and single-cell RNA-Seq data analysis pipelines, which involve:
- Quality control
- Read mapping
- Expression quantification
- Differential analysis
- Clustering / cell type inference

Reconstructing biological networks: reconstruction of intricate protein interaction and signaling networks.
- Cytoscape is commonly used for visualizing and analyzing these complex networks

Future Directions

AI/ML integration
Multi-omics data integration

Challenges and opportunities

Interpretability of AI models
Data-hungry AI models
- Amount of data
- Quality of data
Expanding Scope
Collaboration

References

Crick, F. H. 1958. “On Protein Synthesis.” Symposia of the Society for Experimental Biology 12: 138–63.

Eng, Jimmy K., Ashley L. McCormack, and John R. Yates. 1994. “An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database.” Journal of the American Society for Mass Spectrometry 5 (11): 976–89. https://doi.org/10.1016/1044-0305(94)80016-2.

Genomics, Fios. 2021. “What Is Bioinformatics.” LinkedIn. https://www.linkedin.com/pulse/what-bioinformatics-fios-genomics-ltd/.

Price, Morgan N., Kelly M. Wetmore, R. Jordan Waters, Mark Callaghan, Jayashree Ray, Hualan Liu, Jennifer V. Kuehl, et al. 2018. “Mutant Phenotypes for Thousands of Bacterial Genes of Unknown Function.” Nature 557 (7706): 503–9. https://doi.org/10.1038/s41586-018-0124-0.