max-best-practices

by @modularml in Tools

# Install this skill:

npx skills add modularml/agent-skills --skill "max-best-practices"

Install specific skill from multi-skill repository

# Description

# SKILL.md

name: max-best-practices
description: >
MAX AI inference framework best practices from Modular. Use when deploying
models with MAX Serve, building graphs with MAX Graph API, or optimizing
inference performance. Covers multi-GPU, quantization, and production deployment.
Supports both stable (v25.7) and nightly (v26.1).

MAX Best Practices

Best practices for the MAX AI inference framework. 33+ rules across 8 categories.

Version Support

This skill supports both stable and nightly MAX versions:

Version	MAX	Rules Directory
Stable	v25.7	`rules/` + `rules/stable/`
Nightly	v26.1	`rules/` + `rules/nightly/`

Detect your version: Run max version or check pixi list | grep max

Key differences:

Feature	Stable (v25.7)	Nightly (v26.1)
Batch size semantics	Aggregate across replicas	Per-replica with DP
Driver API	`max.driver.Tensor`	`max.driver.Buffer`
Prefill chunk size	`prefill_chunk_size`	`max_batch_input_tokens`
Max context length	`max_batch_context_length`	`max_batch_total_tokens`
CE batch size CLI	`--max-ce-batch-size`	Deprecated → `--max-batch-size`
Scheduling	Default	`--kvcache-ce-watermark` (new)
Llama 3.2 Vision	Supported	Removed
Gemma3 Vision	Not available	Supported (12B, 27B)
V1 layer classes	Deprecated	Removed
Apple silicon	`accelerator_count()` = 0	Returns non-zero
Streams	Blocking option	All non-blocking

stable changelog | nightly changelog | breaking changes

Related: mojo-best-practices for Mojo language and GPU kernel development.

Quick Decision Guide

Goal	Category	Key Rules
Deploy model endpoint	MAX Serve	`serve-batch-config`, `serve-kv-cache-strategy`
Multi-GPU inference	Parallelism	`multigpu-tensor-parallel`, `multigpu-batch-semantics`
Build custom model	MAX Graph	`graph-construction`, `graph-modules`
Optimize latency	Performance	`perf-prefix-caching`, `perf-chunked-prefill`
Production deployment	Deployment	`deploy-container`, `deploy-kubernetes`
Write custom kernels	Engine + Mojo	`engine-custom-ops` + mojo `gpu-*` rules

Rule Categories

Priority	Category	Count	Prefix
CRITICAL	MAX Serve Configuration	7	`serve-`
CRITICAL	Multi-GPU & Parallelism	5	`multigpu-`
HIGH	MAX Engine	4	`engine-`
HIGH	MAX Graph API	4	`graph-`
HIGH	Model Loading	2	`model-`
MEDIUM	Performance Optimization	3	`perf-`
MEDIUM	Deployment	3	`deploy-`

MAX Serve (CRITICAL)

Rule	Pattern
`serve-batch-config`	`--max-batch-size`, `--max-batch-input-tokens`
`serve-kv-cache-strategy`	PAGED with `--kv-cache-page-size` (multiple of 128)
`serve-prefix-caching`	`--enable-prefix-caching` for common prefixes
`serve-structured-output`	`--enable-structured-output`, JSON schemas
`serve-function-calling`	Tool use, OpenAI-compatible format
`serve-streaming`	SSE chunked responses for TTFT
`serve-health-endpoints`	`/health` for readiness checks
`serve-metrics`	Prometheus metrics, TTFT, ITL

Multi-GPU (CRITICAL)

Rule	Pattern
`multigpu-tensor-parallel`	`--data-parallel-degree N --devices gpu:0,1,...`
`multigpu-batch-semantics`	Per-replica batch size (v26.1+ change)
`multigpu-device-selection`	`--devices gpu:0,1,2,3` (comma-separated)
`multigpu-amd-mi300`	MI300X/MI325X/MI355X support
`multigpu-nvidia-hopper`	H100/H200/B200 optimizations

MAX Engine (HIGH)

Rule	Pattern
`engine-inference-session`	`InferenceSession(devices=[Accelerator()])`
`engine-custom-ops`	`@compiler.register`, `InputTensor`, `OutputTensor`
`engine-graph-caching`	Kernel caching (28% faster compilation)
`engine-subgraphs`	`Graph.add_subgraph()` for device-aware scheduling

MAX Graph API (HIGH)

Rule	Pattern
`graph-construction`	`Graph(TensorType(...))`, `graph.output()`
`graph-modules`	`max.nn.Module`, `Sequential`, `state_dict()`
`graph-quantization`	`Graph.quantize()`, `qmatmul()`
`graph-symbolic-dims`	`AlgebraicDim("batch")` for dynamic shapes

Performance (MEDIUM)

Rule	Pattern
`serve-prefix-caching`	10-50% throughput improvement
`perf-kv-swapping`	`--enable-kvcache-swapping-to-host`
`perf-chunked-prefill`	`--max-batch-input-tokens`
`engine-graph-caching`	28% faster with kernel caching

Deployment (MEDIUM)

Rule	Pattern
`deploy-container`	`modular/max-nvidia-full:latest`
`deploy-kubernetes`	Helm charts, readiness probes
`deploy-benchmark`	`max benchmark`, `benchmark_serving.py`

Cross-References with Mojo

For GPU kernel development, see mojo-best-practices:
- Custom ops → engine-custom-ops + mojo gpu-fundamentals
- GPU memory → mojo gpu-memory-optimization
- Tensor cores → mojo gpu-tensor-core-sm90-sm100
- Warp primitives → mojo gpu-warp-primitives

File Structure

skills/max-best-practices/
├── SKILL.md               # Quick reference (this file)
├── AGENTS.md              # Auto-generated rule index
├── metadata.json          # Skill metadata
├── CHANGELOG.md           # Skill version history
├── reference/
│   ├── breaking-changes.md
│   └── cli-flags.md
└── rules/                 # Version-agnostic rules (~30+)
    ├── serve-*.md
    ├── multigpu-*.md
    ├── engine-*.md
    ├── graph-*.md
    ├── perf-*.md
    ├── deploy-*.md
    ├── stable/            # Stable-only rules (v25.7)
    │   ├── multigpu-batch-semantics.md
    │   └── driver-tensor-api.md
    └── nightly/           # Nightly-only rules (v26.1)
        ├── multigpu-batch-semantics.md
        ├── driver-buffer-api.md
        ├── serve-kvcache-watermark.md
        └── model-vision-changes.md

Local Implementation Notes

When using this skill in a project, agents should collect implementation notes locally within that project, not globally. This ensures project-specific learnings stay with the project.

Where to store notes:

your-project/
├── IMPLEMENTATION_NOTES.md    # Project-specific learnings
├── .cursor/
│   └── rules/                 # Project-specific rules
└── ...

What to capture:
- Model-specific configuration that worked
- Performance tuning for your hardware (GPU type, memory)
- Batch size optimizations for your workload
- Deployment configuration decisions
- Integration patterns with your infrastructure

Usage: Agents should check for and update IMPLEMENTATION_NOTES.md in the project root when discovering new patterns or resolving issues.

CLI flags? See reference/cli-flags.md
Breaking changes? See reference/breaking-changes.md
Full rule index? See AGENTS.md
Mojo/GPU kernels? See mojo-best-practices

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.