Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add modularml/agent-skills --skill "max-best-practices"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: max-best-practices
description: >
MAX AI inference framework best practices from Modular. Use when deploying
models with MAX Serve, building graphs with MAX Graph API, or optimizing
inference performance. Covers multi-GPU, quantization, and production deployment.
Supports both stable (v25.7) and nightly (v26.1).
MAX Best Practices
Best practices for the MAX AI inference framework. 33+ rules across 8 categories.
Version Support
This skill supports both stable and nightly MAX versions:
| Version | MAX | Rules Directory |
|---|---|---|
| Stable | v25.7 | rules/ + rules/stable/ |
| Nightly | v26.1 | rules/ + rules/nightly/ |
Detect your version: Run max version or check pixi list | grep max
Key differences:
| Feature | Stable (v25.7) | Nightly (v26.1) |
|---|---|---|
| Batch size semantics | Aggregate across replicas | Per-replica with DP |
| Driver API | max.driver.Tensor |
max.driver.Buffer |
| Prefill chunk size | prefill_chunk_size |
max_batch_input_tokens |
| Max context length | max_batch_context_length |
max_batch_total_tokens |
| CE batch size CLI | --max-ce-batch-size |
Deprecated β --max-batch-size |
| Scheduling | Default | --kvcache-ce-watermark (new) |
| Llama 3.2 Vision | Supported | Removed |
| Gemma3 Vision | Not available | Supported (12B, 27B) |
| V1 layer classes | Deprecated | Removed |
| Apple silicon | accelerator_count() = 0 |
Returns non-zero |
| Streams | Blocking option | All non-blocking |
stable changelog | nightly changelog | breaking changes
Related: mojo-best-practices for Mojo language and GPU kernel development.
Quick Decision Guide
| Goal | Category | Key Rules |
|---|---|---|
| Deploy model endpoint | MAX Serve | serve-batch-config, serve-kv-cache-strategy |
| Multi-GPU inference | Parallelism | multigpu-tensor-parallel, multigpu-batch-semantics |
| Build custom model | MAX Graph | graph-construction, graph-modules |
| Optimize latency | Performance | perf-prefix-caching, perf-chunked-prefill |
| Production deployment | Deployment | deploy-container, deploy-kubernetes |
| Write custom kernels | Engine + Mojo | engine-custom-ops + mojo gpu-* rules |
Rule Categories
| Priority | Category | Count | Prefix |
|---|---|---|---|
| CRITICAL | MAX Serve Configuration | 7 | serve- |
| CRITICAL | Multi-GPU & Parallelism | 5 | multigpu- |
| HIGH | MAX Engine | 4 | engine- |
| HIGH | MAX Graph API | 4 | graph- |
| HIGH | Model Loading | 2 | model- |
| MEDIUM | Performance Optimization | 3 | perf- |
| MEDIUM | Deployment | 3 | deploy- |
MAX Serve (CRITICAL)
| Rule | Pattern |
|---|---|
serve-batch-config |
--max-batch-size, --max-batch-input-tokens |
serve-kv-cache-strategy |
PAGED with --kv-cache-page-size (multiple of 128) |
serve-prefix-caching |
--enable-prefix-caching for common prefixes |
serve-structured-output |
--enable-structured-output, JSON schemas |
serve-function-calling |
Tool use, OpenAI-compatible format |
serve-streaming |
SSE chunked responses for TTFT |
serve-health-endpoints |
/health for readiness checks |
serve-metrics |
Prometheus metrics, TTFT, ITL |
Multi-GPU (CRITICAL)
| Rule | Pattern |
|---|---|
multigpu-tensor-parallel |
--data-parallel-degree N --devices gpu:0,1,... |
multigpu-batch-semantics |
Per-replica batch size (v26.1+ change) |
multigpu-device-selection |
--devices gpu:0,1,2,3 (comma-separated) |
multigpu-amd-mi300 |
MI300X/MI325X/MI355X support |
multigpu-nvidia-hopper |
H100/H200/B200 optimizations |
MAX Engine (HIGH)
| Rule | Pattern |
|---|---|
engine-inference-session |
InferenceSession(devices=[Accelerator()]) |
engine-custom-ops |
@compiler.register, InputTensor, OutputTensor |
engine-graph-caching |
Kernel caching (28% faster compilation) |
engine-subgraphs |
Graph.add_subgraph() for device-aware scheduling |
MAX Graph API (HIGH)
| Rule | Pattern |
|---|---|
graph-construction |
Graph(TensorType(...)), graph.output() |
graph-modules |
max.nn.Module, Sequential, state_dict() |
graph-quantization |
Graph.quantize(), qmatmul() |
graph-symbolic-dims |
AlgebraicDim("batch") for dynamic shapes |
Performance (MEDIUM)
| Rule | Pattern |
|---|---|
serve-prefix-caching |
10-50% throughput improvement |
perf-kv-swapping |
--enable-kvcache-swapping-to-host |
perf-chunked-prefill |
--max-batch-input-tokens |
engine-graph-caching |
28% faster with kernel caching |
Deployment (MEDIUM)
| Rule | Pattern |
|---|---|
deploy-container |
modular/max-nvidia-full:latest |
deploy-kubernetes |
Helm charts, readiness probes |
deploy-benchmark |
max benchmark, benchmark_serving.py |
Cross-References with Mojo
For GPU kernel development, see mojo-best-practices:
- Custom ops β engine-custom-ops + mojo gpu-fundamentals
- GPU memory β mojo gpu-memory-optimization
- Tensor cores β mojo gpu-tensor-core-sm90-sm100
- Warp primitives β mojo gpu-warp-primitives
File Structure
skills/max-best-practices/
βββ SKILL.md # Quick reference (this file)
βββ AGENTS.md # Auto-generated rule index
βββ metadata.json # Skill metadata
βββ CHANGELOG.md # Skill version history
βββ reference/
β βββ breaking-changes.md
β βββ cli-flags.md
βββ rules/ # Version-agnostic rules (~30+)
βββ serve-*.md
βββ multigpu-*.md
βββ engine-*.md
βββ graph-*.md
βββ perf-*.md
βββ deploy-*.md
βββ stable/ # Stable-only rules (v25.7)
β βββ multigpu-batch-semantics.md
β βββ driver-tensor-api.md
βββ nightly/ # Nightly-only rules (v26.1)
βββ multigpu-batch-semantics.md
βββ driver-buffer-api.md
βββ serve-kvcache-watermark.md
βββ model-vision-changes.md
Local Implementation Notes
When using this skill in a project, agents should collect implementation notes locally within that project, not globally. This ensures project-specific learnings stay with the project.
Where to store notes:
your-project/
βββ IMPLEMENTATION_NOTES.md # Project-specific learnings
βββ .cursor/
β βββ rules/ # Project-specific rules
βββ ...
What to capture:
- Model-specific configuration that worked
- Performance tuning for your hardware (GPU type, memory)
- Batch size optimizations for your workload
- Deployment configuration decisions
- Integration patterns with your infrastructure
Usage: Agents should check for and update IMPLEMENTATION_NOTES.md in the project root when discovering new patterns or resolving issues.
Navigation
- CLI flags? See reference/cli-flags.md
- Breaking changes? See reference/breaking-changes.md
- Full rule index? See AGENTS.md
- Mojo/GPU kernels? See mojo-best-practices
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.