Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add yzlnew/infra-skills --skill "megatron-memory-estimator"
Install specific skill from multi-skill repository
# Description
Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.
# SKILL.md
name: megatron-memory-estimator
description: Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.
Megatron Memory Estimator
Estimate GPU memory usage for Megatron-based models directly from HuggingFace configs or custom specifications.
Quick Start
Option 1: From HuggingFace Model (Recommended)
Estimate directly from HuggingFace model paths:
# DeepSeek-V3 (61 layers, requires layer distribution when pp>1)
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 4 --pp 4 --ep 8 --num-gpus 128 --num-layers-in-last-pipeline-stage 16
# Qwen 3
python scripts/estimate_from_hf.py Qwen/Qwen3-235B-A22B \
--tp 8 --pp 4 --ep 4 --num-gpus 128
Option 2: From Local HF Config
python scripts/estimate_from_hf.py /path/to/config.json \
--tp 2 --pp 2 --num-gpus 8
Option 3: Quick Parameter Testing
# Test different parallelism strategies
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 8 --pp 2 --ep 16 --num-layers-in-last-pipeline-stage 31 # Strategy 1 (30+31=61)
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 4 --pp 4 --ep 8 --num-layers-in-last-pipeline-stage 16 # Strategy 2 (15+15+15+16=61)
# Test different batch sizes
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 4 --pp 4 --ep 8 --micro-batch-size 2 --num-layers-in-last-pipeline-stage 16
Available Scripts
estimate_from_hf.py (Primary Script)
Automatically converts HuggingFace configs to Megatron format and estimates memory.
Key Arguments:
- model_path: HF model path or local config.json path
- --tp N: Tensor parallel size (default: 1)
- --pp N: Pipeline parallel size (default: 1)
- --ep N: Expert parallel size (default: 1, for MoE)
- --cp N: Context parallel size (default: 1)
- --etp N: Expert tensor parallel size (optional)
- --vpp N: Virtual pipeline parallel size (optional)
- --micro-batch-size N: Micro batch size (default: 1)
- --seq-length N: Sequence length (default: 4096)
- --num-gpus N: Total GPU count (default: 8)
- --recompute-granularity {full,selective}: Enable activation checkpointing
- --num-layers-in-first-pipeline-stage N: Number of layers in the first pipeline stage (use when model layers cannot be evenly divided by --pp)
- --num-layers-in-last-pipeline-stage N: Number of layers in the last pipeline stage (use when model layers cannot be evenly divided by --pp)
- --verbose: Show detailed model breakdown
- --json: Output as JSON
Examples:
# Basic estimation
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 --num-gpus 64
# With memory optimization
python scripts/estimate_from_hf.py Qwen/Qwen3-235B-A22B \
--tp 8 --pp 4 --ep 4 \
--recompute-granularity full \
--recompute-method uniform \
--num-gpus 128
# Verbose output
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 4 --pp 4 --ep 8 --verbose --num-layers-in-last-pipeline-stage 16
# JSON output for automation
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 4 --pp 4 --ep 8 --json --num-layers-in-last-pipeline-stage 16 > result.json
Common Workflows
Find Optimal Parallelism for a Model
# Start with model path
MODEL="deepseek-ai/DeepSeek-V3"
GPUS=128
# Test different strategies
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 8 --num-gpus $GPUS --num-layers-in-last-pipeline-stage 16
python scripts/estimate_from_hf.py $MODEL --tp 8 --pp 2 --ep 8 --num-gpus $GPUS --num-layers-in-last-pipeline-stage 31
# Choose strategy that fits GPU memory with best efficiency
Optimize for Memory Efficiency
Progressive memory reduction:
# 1. Baseline
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --num-gpus 16
# 2. Add recomputation
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --num-gpus 16 \
--recompute-granularity full
# 3. Increase expert parallelism (MoE only)
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --ep 4 --num-gpus 16 \
--recompute-granularity full
# 4. Increase pipeline parallelism
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 4 --num-gpus 16 \
--recompute-granularity full
# 5. Last resort: reduce batch size
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 4 --num-gpus 16 \
--recompute-granularity full --micro-batch-size 1
Check if Model Fits Available GPUs
# Check if DeepSeek-V3 fits in 128x A100 80GB
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
--tp 4 --pp 4 --ep 8 --num-gpus 128 --num-layers-in-last-pipeline-stage 16
# Output will show peak memory per GPU
# If < 80 GB: β Fits
# If > 80 GB: Need more parallelism or optimization
Understanding Output
The estimator shows:
================================================================================
CONFIGURATION SUMMARY
================================================================================
Model Type: deepseek_v3
Architecture: 61L-7168H
MoE: 256 experts, top-8
Parallelism:
TP=4, PP=4, EP=8, CP=1
Training:
Micro Batch Size: 1
Sequence Length: 4096
Total GPUs: 128
================================================================================
MEMORY ESTIMATION RESULTS
================================================================================
Pipeline Stage 0:
Parameters: 3.15B
Activations: 1.23B
Memory Breakdown:
- Weights + Gradients: 18.90 GB
- Weights + Gradients + Optimizer: 37.80 GB
- Activations: 2.46 GB
- Total: 40.26 GB
================================================================================
Peak Memory per GPU: 40.26 GB
β Fits in: A100 80GB, H100
================================================================================
Memory Components:
- Weights + Gradients: Parameters and gradients (2+2=4 bytes/param in FP16)
- Optimizer States: Adam momentum + variance (8 bytes/param)
- Activations: Forward pass activations stored for backward
GPU Fit Guidelines:
- < 40 GB: A100 40GB, A100 80GB, H100
- < 80 GB: A100 80GB, H100 80GB
- > 80 GB: H200 141GB or consider more parallelism or smaller batch
Memory Optimization Techniques
Ranked by effectiveness:
- Enable Distributed Optimizer (included by default)
- Shards optimizer states across data parallel ranks
-
~6 bytes/param saving
-
Activation Recomputation (
--recompute-granularity full) - 50-70% activation memory reduction
-
Trade compute for memory
-
Increase Expert Parallelism (MoE only) (
--ep N) - Linear memory reduction for expert layers
-
Minimal performance impact
-
Increase Pipeline Parallelism (
--pp N) - Splits model across more stages
-
Some pipeline bubble overhead
-
Reduce Batch Size (
--micro-batch-size 1) - Direct activation memory reduction
- Impacts throughput
Supported Models
The script automatically handles:
- DeepSeek: DeepSeek-V2, DeepSeek-V3
- Qwen: Qwen2.5, Qwen3 (dense and MoE)
- Moonlight: Kimi models
- Any HuggingFace model with config.json
Setup & Troubleshooting
Because this tool relies on Megatron-LM components, you need to add both the tool directory and Megatron-LM to your PYTHONPATH.
Recommended Setup:
# Add current directory and Megatron-LM to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd):/path/to/Megatron-LM
If you encounter ImportError: No module named 'megatron_memory_estimator', ensure the root directory of this skill is in your PYTHONPATH.
Dependencies
Required:
- mbridge: HuggingFace to Megatron config bridge
- transformers: HuggingFace transformers library
- torch: PyTorch (CPU version sufficient)
- megatron-core: Megatron core library
Installation:
pip install mbridge transformers torch megatron-core==0.13.0
For full Megatron-LM support (optional):
pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
Reference Documentation
For detailed configuration options:
- references/configuration_guide.md: All configuration parameters
- references/parallelism_strategies.md: Parallelism strategy guide
Notes
- Estimates are theoretical based on model architecture
- Actual memory may vary Β±10-15% due to framework overhead
- Always leave 10-20% memory headroom for safety
- Test on small scale before full deployment
- MoE models: Expert parallelism (EP) is critical for memory efficiency
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.