Alignment Validation

Name: Alignment Validation
Rating: 5 (489 reviews)
Author: FreedomIntelligence

by @FreedomIntelligence in AI & LLM

489

# Install this skill:

npx skills add FreedomIntelligence/OpenClaw-Medical-Skills --skill "Alignment Validation"

Install specific skill from multi-skill repository

# Description

Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.

# SKILL.md

name: bio-alignment-validation
description: Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.
tool_type: mixed
primary_tool: samtools
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
- read_file
- run_shell_command

Alignment Validation

Post-alignment quality control to verify alignment quality and identify issues.

Insert Size Distribution

Insert size should match library preparation protocol.

samtools stats

samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt

Picard CollectInsertSizeMetrics

java -jar picard.jar CollectInsertSizeMetrics \
    I=input.bam \
    O=insert_metrics.txt \
    H=insert_histogram.pdf

Expected Insert Sizes

Library Type	Expected Size
Standard WGS	300-500 bp
PCR-free	350-550 bp
RNA-seq	150-300 bp
ChIP-seq	150-300 bp
ATAC-seq	Multimodal

Python Insert Size Analysis

import pysam
import numpy as np
import matplotlib.pyplot as plt

def get_insert_sizes(bam_file, max_reads=100000):
    sizes = []
    bam = pysam.AlignmentFile(bam_file, 'rb')
    for i, read in enumerate(bam.fetch()):
        if i >= max_reads:
            break
        if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
            sizes.append(read.template_length)
    bam.close()
    return sizes

sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')

plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')

Proper Pairing Rate

Percentage of reads correctly paired.

samtools flagstat

samtools flagstat input.bam

samtools flagstat input.bam | grep "properly paired"

Calculate Pairing Rate

proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"

Expected Rates

Metric	Good	Marginal	Poor
Proper pair	> 90%	80-90%	< 80%
Mapped	> 95%	90-95%	< 90%
Singletons	< 5%	5-10%	> 10%

GC Bias

GC content correlation with coverage.

Picard CollectGcBiasMetrics

java -jar picard.jar CollectGcBiasMetrics \
    I=input.bam \
    O=gc_bias_metrics.txt \
    CHART=gc_bias_chart.pdf \
    S=gc_summary.txt \
    R=reference.fa

deepTools computeGCBias

computeGCBias \
    -b input.bam \
    --effectiveGenomeSize 2913022398 \
    -g hg38.2bit \
    -o gc_bias.txt \
    --biasPlot gc_bias.pdf

Interpret GC Bias

Issue	Symptom
Under-representation	Low GC coverage drops
Over-representation	High GC coverage elevated
PCR bias	Strong correlation

Strand Balance

Forward and reverse strand should be balanced.

Calculate Strand Ratio

forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"

Check Strand Bias per Chromosome

for chr in chr1 chr2 chr3; do
    fwd=$(samtools view -c -F 16 input.bam $chr)
    rev=$(samtools view -c -f 16 input.bam $chr)
    echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done

Mapping Quality Distribution

Extract MAPQ Distribution

samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n

Calculate Mean MAPQ

samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'

MAPQ Thresholds

MAPQ	Meaning
0	Multi-mapper
1-10	Low confidence
20-30	Moderate
40+	High confidence
60	Unique (BWA)

Chromosome Coverage Balance

Calculate Per-Chromosome Coverage

samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25

Check for Aneuploidy/Contamination

samtools idxstats input.bam | awk '$3 > 0 {
    sum += $3
    len[$1] = $2
    reads[$1] = $3
} END {
    for (chr in reads) {
        expected = len[chr] / sum * reads[chr]
        ratio = reads[chr] / expected
        if (ratio < 0.8 || ratio > 1.2) print chr, ratio
    }
}'

Mismatch Rate

Picard CollectAlignmentSummaryMetrics

java -jar picard.jar CollectAlignmentSummaryMetrics \
    I=input.bam \
    R=reference.fa \
    O=alignment_summary.txt

Key Metrics

Metric	Description	Good Value
PCT_PF_READS_ALIGNED	Mapped %	> 95%
PF_MISMATCH_RATE	Mismatches	< 1%
PF_INDEL_RATE	Indels	< 0.1%
STRAND_BALANCE	Strand ratio	~0.5

Comprehensive Validation Script

#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}

mkdir -p $OUTDIR

echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt

echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt

echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt

echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt

echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt

echo -e "\nReport: $OUTDIR/report.txt"

Python Validation Module

import pysam
import numpy as np
from collections import Counter

class AlignmentValidator:
    def __init__(self, bam_file):
        self.bam = pysam.AlignmentFile(bam_file, 'rb')

    def mapping_rate(self):
        stats = self.bam.get_index_statistics()
        mapped = sum(s.mapped for s in stats)
        unmapped = sum(s.unmapped for s in stats)
        return mapped / (mapped + unmapped) * 100

    def proper_pair_rate(self, sample_size=100000):
        proper = 0
        paired = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if read.is_paired:
                paired += 1
                if read.is_proper_pair:
                    proper += 1
        return proper / paired * 100 if paired > 0 else 0

    def mapq_distribution(self, sample_size=100000):
        mapqs = []
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                mapqs.append(read.mapping_quality)
        return Counter(mapqs)

    def strand_balance(self, sample_size=100000):
        forward = 0
        reverse = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                if read.is_reverse:
                    reverse += 1
                else:
                    forward += 1
        return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5

    def report(self):
        print(f'Mapping rate: {self.mapping_rate():.1f}%')
        print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
        print(f'Strand balance: {self.strand_balance():.3f}')

        mapq_dist = self.mapq_distribution()
        high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
        total = sum(mapq_dist.values())
        print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')

    def close(self):
        self.bam.close()

validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()

Quality Thresholds Summary

Metric	Good	Warning	Fail
Mapping rate	> 95%	90-95%	< 90%
Proper pairing	> 90%	80-90%	< 80%
Duplicate rate	< 10%	10-20%	> 20%
Strand balance	0.48-0.52	0.45-0.55	Outside
Mean MAPQ	> 40	30-40	< 30
GC bias	< 1.2x	1.2-1.5x	> 1.5x

bam-statistics - Basic flagstat and depth
duplicate-handling - Mark/remove duplicates
alignment-filtering - Filter low-quality reads
chip-seq/chipseq-qc - ChIP-specific metrics

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

Alignment Validation

# Description

# SKILL.md

Alignment Validation

Insert Size Distribution

samtools stats

Picard CollectInsertSizeMetrics

Expected Insert Sizes

Python Insert Size Analysis

Proper Pairing Rate

samtools flagstat

Calculate Pairing Rate

Expected Rates

GC Bias

Picard CollectGcBiasMetrics

deepTools computeGCBias

Interpret GC Bias

Strand Balance

Calculate Strand Ratio

Check Strand Bias per Chromosome

Mapping Quality Distribution

Extract MAPQ Distribution

Calculate Mean MAPQ

MAPQ Thresholds

Chromosome Coverage Balance

Calculate Per-Chromosome Coverage

Check for Aneuploidy/Contamination

Mismatch Rate

Picard CollectAlignmentSummaryMetrics

Key Metrics

Comprehensive Validation Script

Python Validation Module

Quality Thresholds Summary

Related Skills

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill