yzlnew

megatron-memory-estimator

51
3
# Install this skill:
npx skills add yzlnew/infra-skills --skill "megatron-memory-estimator"

Install specific skill from multi-skill repository

# Description

Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.

# SKILL.md


name: megatron-memory-estimator
description: Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.


Megatron Memory Estimator

Estimate GPU memory usage for Megatron-based models directly from HuggingFace configs or custom specifications.

Quick Start

Estimate directly from HuggingFace model paths:

# DeepSeek-V3 (61 layers, requires layer distribution when pp>1)
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --num-gpus 128 --num-layers-in-last-pipeline-stage 16

# Qwen 3
python scripts/estimate_from_hf.py Qwen/Qwen3-235B-A22B \
    --tp 8 --pp 4 --ep 4 --num-gpus 128

Option 2: From Local HF Config

python scripts/estimate_from_hf.py /path/to/config.json \
    --tp 2 --pp 2 --num-gpus 8

Option 3: Quick Parameter Testing

# Test different parallelism strategies
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 8 --pp 2 --ep 16 --num-layers-in-last-pipeline-stage 31  # Strategy 1 (30+31=61)

python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --num-layers-in-last-pipeline-stage 16   # Strategy 2 (15+15+15+16=61)

# Test different batch sizes
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --micro-batch-size 2 --num-layers-in-last-pipeline-stage 16

Available Scripts

estimate_from_hf.py (Primary Script)

Automatically converts HuggingFace configs to Megatron format and estimates memory.

Key Arguments:
- model_path: HF model path or local config.json path
- --tp N: Tensor parallel size (default: 1)
- --pp N: Pipeline parallel size (default: 1)
- --ep N: Expert parallel size (default: 1, for MoE)
- --cp N: Context parallel size (default: 1)
- --etp N: Expert tensor parallel size (optional)
- --vpp N: Virtual pipeline parallel size (optional)
- --micro-batch-size N: Micro batch size (default: 1)
- --seq-length N: Sequence length (default: 4096)
- --num-gpus N: Total GPU count (default: 8)
- --recompute-granularity {full,selective}: Enable activation checkpointing
- --num-layers-in-first-pipeline-stage N: Number of layers in the first pipeline stage (use when model layers cannot be evenly divided by --pp)
- --num-layers-in-last-pipeline-stage N: Number of layers in the last pipeline stage (use when model layers cannot be evenly divided by --pp)
- --verbose: Show detailed model breakdown
- --json: Output as JSON

Examples:

# Basic estimation
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 --num-gpus 64

# With memory optimization
python scripts/estimate_from_hf.py Qwen/Qwen3-235B-A22B \
    --tp 8 --pp 4 --ep 4 \
    --recompute-granularity full \
    --recompute-method uniform \
    --num-gpus 128

# Verbose output
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --verbose --num-layers-in-last-pipeline-stage 16

# JSON output for automation
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --json --num-layers-in-last-pipeline-stage 16 > result.json

Common Workflows

Find Optimal Parallelism for a Model

# Start with model path
MODEL="deepseek-ai/DeepSeek-V3"
GPUS=128

# Test different strategies
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 8 --num-gpus $GPUS --num-layers-in-last-pipeline-stage 16
python scripts/estimate_from_hf.py $MODEL --tp 8 --pp 2 --ep 8 --num-gpus $GPUS --num-layers-in-last-pipeline-stage 31


# Choose strategy that fits GPU memory with best efficiency

Optimize for Memory Efficiency

Progressive memory reduction:

# 1. Baseline
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --num-gpus 16

# 2. Add recomputation
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --num-gpus 16 \
    --recompute-granularity full

# 3. Increase expert parallelism (MoE only)
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --ep 4 --num-gpus 16 \
    --recompute-granularity full

# 4. Increase pipeline parallelism
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 4 --num-gpus 16 \
    --recompute-granularity full

# 5. Last resort: reduce batch size
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 4 --num-gpus 16 \
    --recompute-granularity full --micro-batch-size 1

Check if Model Fits Available GPUs

# Check if DeepSeek-V3 fits in 128x A100 80GB
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --num-gpus 128 --num-layers-in-last-pipeline-stage 16

# Output will show peak memory per GPU
# If < 80 GB: βœ“ Fits
# If > 80 GB: Need more parallelism or optimization

Understanding Output

The estimator shows:

================================================================================
CONFIGURATION SUMMARY
================================================================================

Model Type: deepseek_v3
Architecture: 61L-7168H
MoE: 256 experts, top-8

Parallelism:
  TP=4, PP=4, EP=8, CP=1

Training:
  Micro Batch Size: 1
  Sequence Length: 4096
  Total GPUs: 128

================================================================================
MEMORY ESTIMATION RESULTS
================================================================================

Pipeline Stage 0:
  Parameters: 3.15B
  Activations: 1.23B
  Memory Breakdown:
    - Weights + Gradients: 18.90 GB
    - Weights + Gradients + Optimizer: 37.80 GB
    - Activations: 2.46 GB
    - Total: 40.26 GB

================================================================================
Peak Memory per GPU: 40.26 GB
βœ“ Fits in: A100 80GB, H100
================================================================================

Memory Components:
- Weights + Gradients: Parameters and gradients (2+2=4 bytes/param in FP16)
- Optimizer States: Adam momentum + variance (8 bytes/param)
- Activations: Forward pass activations stored for backward

GPU Fit Guidelines:
- < 40 GB: A100 40GB, A100 80GB, H100
- < 80 GB: A100 80GB, H100 80GB
- > 80 GB: H200 141GB or consider more parallelism or smaller batch

Memory Optimization Techniques

Ranked by effectiveness:

  1. Enable Distributed Optimizer (included by default)
  2. Shards optimizer states across data parallel ranks
  3. ~6 bytes/param saving

  4. Activation Recomputation (--recompute-granularity full)

  5. 50-70% activation memory reduction
  6. Trade compute for memory

  7. Increase Expert Parallelism (MoE only) (--ep N)

  8. Linear memory reduction for expert layers
  9. Minimal performance impact

  10. Increase Pipeline Parallelism (--pp N)

  11. Splits model across more stages
  12. Some pipeline bubble overhead

  13. Reduce Batch Size (--micro-batch-size 1)

  14. Direct activation memory reduction
  15. Impacts throughput

Supported Models

The script automatically handles:

  • DeepSeek: DeepSeek-V2, DeepSeek-V3
  • Qwen: Qwen2.5, Qwen3 (dense and MoE)
  • Moonlight: Kimi models
  • Any HuggingFace model with config.json

Setup & Troubleshooting

Because this tool relies on Megatron-LM components, you need to add both the tool directory and Megatron-LM to your PYTHONPATH.

Recommended Setup:

# Add current directory and Megatron-LM to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd):/path/to/Megatron-LM

If you encounter ImportError: No module named 'megatron_memory_estimator', ensure the root directory of this skill is in your PYTHONPATH.

Dependencies

Required:
- mbridge: HuggingFace to Megatron config bridge
- transformers: HuggingFace transformers library
- torch: PyTorch (CPU version sufficient)
- megatron-core: Megatron core library

Installation:

pip install mbridge transformers torch megatron-core==0.13.0

For full Megatron-LM support (optional):

pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0

Reference Documentation

For detailed configuration options:
- references/configuration_guide.md: All configuration parameters
- references/parallelism_strategies.md: Parallelism strategy guide

Notes

  • Estimates are theoretical based on model architecture
  • Actual memory may vary Β±10-15% due to framework overhead
  • Always leave 10-20% memory headroom for safety
  • Test on small scale before full deployment
  • MoE models: Expert parallelism (EP) is critical for memory efficiency

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.