bio-alignment-pairwise

Name: bio-alignment-pairwise
Rating: 5 (489 reviews)
Author: FreedomIntelligence

by @FreedomIntelligence in AI & LLM

489

# Install this skill:

npx skills add FreedomIntelligence/OpenClaw-Medical-Skills --skill "bio-alignment-pairwise"

Install specific skill from multi-skill repository

# Description

Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.

# SKILL.md

name: bio-alignment-pairwise
description: Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
tool_type: python
primary_tool: Bio.Align

Version Compatibility

Reference examples tested with: BioPython 1.83+

Before using code patterns, verify installed versions match. If versions differ:
- Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

Pairwise Sequence Alignment

"Align two sequences" → Compute an optimal alignment between a pair of sequences using dynamic programming.
- Python: PairwiseAligner() (BioPython Bio.Align)
- CLI: needle (global) or water (local) from EMBOSS
- R: pairwiseAlignment() (Biostrings)

Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).

Required Import

Goal: Load modules needed for pairwise alignment operations.

Approach: Import the PairwiseAligner class along with sequence and I/O utilities from Biopython.

from Bio.Align import PairwiseAligner
from Bio.Seq import Seq
from Bio import SeqIO

Core Concepts

Mode	Algorithm	Use Case
`global`	Needleman-Wunsch	Full-length alignment, similar-length sequences
`local`	Smith-Waterman	Find best matching regions, different-length sequences

Creating an Aligner

Goal: Configure a PairwiseAligner with appropriate scoring for the sequence type.

Approach: Instantiate PairwiseAligner with mode, scoring parameters, or a substitution matrix depending on DNA vs protein input.

# Basic aligner with defaults
aligner = PairwiseAligner()

# Configure mode and scoring
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

# For protein alignment with substitution matrix
from Bio.Align import substitution_matrices
aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))

Performing Alignments

"Align two sequences" → Compute optimal alignment(s) between a pair of sequences, returning alignment objects or a score.

Goal: Align two sequences and retrieve the optimal alignment(s) or score.

Approach: Call aligner.align() for full alignment objects or aligner.score() for score-only (faster for large sequences).

seq1 = Seq('ACCGGTAACGTAG')
seq2 = Seq('ACCGTTAACGAAG')

# Get all optimal alignments
alignments = aligner.align(seq1, seq2)
print(f'Found {len(alignments)} optimal alignments')
print(alignments[0])  # Print first alignment

# Get score only (faster for large sequences)
score = aligner.score(seq1, seq2)

Alignment Output Format

target            0 ACCGGTAACGTAG 13
                  0 |||||.||||.|| 13
query             0 ACCGTTAACGAAG 13

Accessing Alignment Data

Goal: Extract alignment properties including score, shape, aligned sequences, and coordinate mappings.

Approach: Access alignment object attributes and indexing to retrieve per-sequence aligned strings and coordinate arrays.

alignment = alignments[0]

# Basic properties
print(alignment.score)                    # Alignment score
print(alignment.shape)                    # (num_seqs, alignment_length)
print(len(alignment))                     # Alignment length

# Get aligned sequences with gaps
target_aligned = alignment[0, :]          # First sequence (target) with gaps
query_aligned = alignment[1, :]           # Second sequence (query) with gaps

# Get coordinate mapping
print(alignment.aligned)                  # Array of aligned segment coordinates
print(alignment.coordinates)              # Full coordinate array

Alignment Counts (Identities, Mismatches, Gaps)

Goal: Quantify identities, mismatches, and gaps in an alignment to calculate percent identity.

Approach: Use the .counts() method on the alignment object and derive percent identity from identity and mismatch totals.

alignment = alignments[0]
counts = alignment.counts()

print(f'Identities: {counts.identities}')
print(f'Mismatches: {counts.mismatches}')
print(f'Gaps: {counts.gaps}')

# Calculate percent identity
total_aligned = counts.identities + counts.mismatches
percent_identity = counts.identities / total_aligned * 100
print(f'Percent identity: {percent_identity:.1f}%')

Common Scoring Configurations

DNA/RNA Alignment

aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Protein Alignment

from Bio.Align import substitution_matrices
blosum62 = substitution_matrices.load('BLOSUM62')
aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)

Local Alignment (Find Best Region)

aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Semiglobal (Overlap/Extension)

# Allow free end gaps on query (useful for primer alignment)
aligner = PairwiseAligner(mode='global')
aligner.query_left_open_gap_score = 0
aligner.query_left_extend_gap_score = 0
aligner.query_right_open_gap_score = 0
aligner.query_right_extend_gap_score = 0

Available Substitution Matrices

Goal: Load and select substitution matrices for protein alignment scoring.

Approach: List available matrices with substitution_matrices.load() and load specific ones (BLOSUM62 for general, BLOSUM80 for close homologs, PAM250 for distant).

from Bio.Align import substitution_matrices
print(substitution_matrices.load())  # List all available matrices

# Common matrices
blosum62 = substitution_matrices.load('BLOSUM62')  # General protein
blosum80 = substitution_matrices.load('BLOSUM80')  # Closely related proteins
pam250 = substitution_matrices.load('PAM250')      # Distantly related proteins

Working with SeqRecord Objects

Goal: Align sequences loaded from FASTA files rather than hardcoded strings.

Approach: Parse SeqRecord objects from a FASTA file and pass their .seq attributes to the aligner.

from Bio import SeqIO

records = list(SeqIO.parse('sequences.fasta', 'fasta'))
seq1, seq2 = records[0].seq, records[1].seq

aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
alignments = aligner.align(seq1, seq2)

Iterating Over Multiple Alignments

# Limit number of alignments returned (memory efficient)
aligner.max_alignments = 100

for i, alignment in enumerate(alignments):
    print(f'Alignment {i+1}: score={alignment.score}')
    if i >= 4:
        break

Substitution Matrix from Alignment

Goal: Extract observed substitution frequencies from a completed alignment.

Approach: Access the .substitutions property to get a matrix of observed base/residue substitution counts.

alignment = alignments[0]
substitutions = alignment.substitutions

# View as array (rows=target, cols=query)
print(substitutions)

# Access specific substitution counts
# substitutions['A', 'T'] gives count of A aligned to T

Export Alignment to Different Formats

Goal: Convert an alignment to standard bioinformatics file formats for downstream tools.

Approach: Use Python's format() function with format specifiers (fasta, clustal, psl, sam) on the alignment object.

alignment = alignments[0]

# Various output formats
print(format(alignment, 'fasta'))     # FASTA format
print(format(alignment, 'clustal'))   # Clustal format
print(format(alignment, 'psl'))       # PSL format (BLAT)
print(format(alignment, 'sam'))       # SAM format

Quick Reference: Scoring Parameters

Parameter	Description	Typical DNA	Typical Protein
`match_score`	Score for identical bases	1-2	Use matrix
`mismatch_score`	Penalty for mismatches	-1 to -3	Use matrix
`open_gap_score`	Cost to start a gap	-5 to -15	-10 to -12
`extend_gap_score`	Cost per gap extension	-0.5 to -2	-0.5 to -1
`substitution_matrix`	Scoring matrix	N/A	BLOSUM62

Common Errors

Error	Cause	Solution
`OverflowError`	Too many optimal alignments	Set `aligner.max_alignments`
Low scores	Wrong scoring scheme	Use substitution matrix for proteins
No alignments in local mode	Scores all negative	Ensure `match_score` > 0

Decision Tree: Choosing Alignment Mode

Need full-length comparison?
├── Yes → Use mode='global'
│   └── Sequences similar length?
│       ├── Yes → Standard global
│       └── No → Consider semiglobal (free end gaps)
└── No → Use mode='local'
    └── Find best matching regions only

alignment-io - Save alignments to files in various formats
msa-parsing - Work with multiple sequence alignments
msa-statistics - Calculate identity, similarity metrics
sequence-manipulation/motif-search - Pattern matching in sequences

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.