Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add FreedomIntelligence/OpenClaw-Medical-Skills --skill "Alignment Validation"
Install specific skill from multi-skill repository
# Description
Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.
# SKILL.md
name: bio-alignment-validation
description: Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.
tool_type: mixed
primary_tool: samtools
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
- read_file
- run_shell_command
Alignment Validation
Post-alignment quality control to verify alignment quality and identify issues.
Insert Size Distribution
Insert size should match library preparation protocol.
samtools stats
samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt
Picard CollectInsertSizeMetrics
java -jar picard.jar CollectInsertSizeMetrics \
I=input.bam \
O=insert_metrics.txt \
H=insert_histogram.pdf
Expected Insert Sizes
| Library Type | Expected Size |
|---|---|
| Standard WGS | 300-500 bp |
| PCR-free | 350-550 bp |
| RNA-seq | 150-300 bp |
| ChIP-seq | 150-300 bp |
| ATAC-seq | Multimodal |
Python Insert Size Analysis
import pysam
import numpy as np
import matplotlib.pyplot as plt
def get_insert_sizes(bam_file, max_reads=100000):
sizes = []
bam = pysam.AlignmentFile(bam_file, 'rb')
for i, read in enumerate(bam.fetch()):
if i >= max_reads:
break
if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
sizes.append(read.template_length)
bam.close()
return sizes
sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')
plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')
Proper Pairing Rate
Percentage of reads correctly paired.
samtools flagstat
samtools flagstat input.bam
samtools flagstat input.bam | grep "properly paired"
Calculate Pairing Rate
proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"
Expected Rates
| Metric | Good | Marginal | Poor |
|---|---|---|---|
| Proper pair | > 90% | 80-90% | < 80% |
| Mapped | > 95% | 90-95% | < 90% |
| Singletons | < 5% | 5-10% | > 10% |
GC Bias
GC content correlation with coverage.
Picard CollectGcBiasMetrics
java -jar picard.jar CollectGcBiasMetrics \
I=input.bam \
O=gc_bias_metrics.txt \
CHART=gc_bias_chart.pdf \
S=gc_summary.txt \
R=reference.fa
deepTools computeGCBias
computeGCBias \
-b input.bam \
--effectiveGenomeSize 2913022398 \
-g hg38.2bit \
-o gc_bias.txt \
--biasPlot gc_bias.pdf
Interpret GC Bias
| Issue | Symptom |
|---|---|
| Under-representation | Low GC coverage drops |
| Over-representation | High GC coverage elevated |
| PCR bias | Strong correlation |
Strand Balance
Forward and reverse strand should be balanced.
Calculate Strand Ratio
forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"
Check Strand Bias per Chromosome
for chr in chr1 chr2 chr3; do
fwd=$(samtools view -c -F 16 input.bam $chr)
rev=$(samtools view -c -f 16 input.bam $chr)
echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done
Mapping Quality Distribution
Extract MAPQ Distribution
samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n
Calculate Mean MAPQ
samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'
MAPQ Thresholds
| MAPQ | Meaning |
|---|---|
| 0 | Multi-mapper |
| 1-10 | Low confidence |
| 20-30 | Moderate |
| 40+ | High confidence |
| 60 | Unique (BWA) |
Chromosome Coverage Balance
Calculate Per-Chromosome Coverage
samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25
Check for Aneuploidy/Contamination
samtools idxstats input.bam | awk '$3 > 0 {
sum += $3
len[$1] = $2
reads[$1] = $3
} END {
for (chr in reads) {
expected = len[chr] / sum * reads[chr]
ratio = reads[chr] / expected
if (ratio < 0.8 || ratio > 1.2) print chr, ratio
}
}'
Mismatch Rate
Picard CollectAlignmentSummaryMetrics
java -jar picard.jar CollectAlignmentSummaryMetrics \
I=input.bam \
R=reference.fa \
O=alignment_summary.txt
Key Metrics
| Metric | Description | Good Value |
|---|---|---|
| PCT_PF_READS_ALIGNED | Mapped % | > 95% |
| PF_MISMATCH_RATE | Mismatches | < 1% |
| PF_INDEL_RATE | Indels | < 0.1% |
| STRAND_BALANCE | Strand ratio | ~0.5 |
Comprehensive Validation Script
#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}
mkdir -p $OUTDIR
echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt
echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt
echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt
echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt
echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt
echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt
echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt
echo -e "\nReport: $OUTDIR/report.txt"
Python Validation Module
import pysam
import numpy as np
from collections import Counter
class AlignmentValidator:
def __init__(self, bam_file):
self.bam = pysam.AlignmentFile(bam_file, 'rb')
def mapping_rate(self):
stats = self.bam.get_index_statistics()
mapped = sum(s.mapped for s in stats)
unmapped = sum(s.unmapped for s in stats)
return mapped / (mapped + unmapped) * 100
def proper_pair_rate(self, sample_size=100000):
proper = 0
paired = 0
for i, read in enumerate(self.bam.fetch()):
if i >= sample_size:
break
if read.is_paired:
paired += 1
if read.is_proper_pair:
proper += 1
return proper / paired * 100 if paired > 0 else 0
def mapq_distribution(self, sample_size=100000):
mapqs = []
for i, read in enumerate(self.bam.fetch()):
if i >= sample_size:
break
if not read.is_unmapped:
mapqs.append(read.mapping_quality)
return Counter(mapqs)
def strand_balance(self, sample_size=100000):
forward = 0
reverse = 0
for i, read in enumerate(self.bam.fetch()):
if i >= sample_size:
break
if not read.is_unmapped:
if read.is_reverse:
reverse += 1
else:
forward += 1
return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5
def report(self):
print(f'Mapping rate: {self.mapping_rate():.1f}%')
print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
print(f'Strand balance: {self.strand_balance():.3f}')
mapq_dist = self.mapq_distribution()
high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
total = sum(mapq_dist.values())
print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')
def close(self):
self.bam.close()
validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()
Quality Thresholds Summary
| Metric | Good | Warning | Fail |
|---|---|---|---|
| Mapping rate | > 95% | 90-95% | < 90% |
| Proper pairing | > 90% | 80-90% | < 80% |
| Duplicate rate | < 10% | 10-20% | > 20% |
| Strand balance | 0.48-0.52 | 0.45-0.55 | Outside |
| Mean MAPQ | > 40 | 30-40 | < 30 |
| GC bias | < 1.2x | 1.2-1.5x | > 1.5x |
Related Skills
- bam-statistics - Basic flagstat and depth
- duplicate-handling - Mark/remove duplicates
- alignment-filtering - Filter low-quality reads
- chip-seq/chipseq-qc - ChIP-specific metrics
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.