modularml

max-best-practices

0
0
# Install this skill:
npx skills add modularml/agent-skills --skill "max-best-practices"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: max-best-practices
description: >
MAX AI inference framework best practices from Modular. Use when deploying
models with MAX Serve, building graphs with MAX Graph API, or optimizing
inference performance. Covers multi-GPU, quantization, and production deployment.
Supports both stable (v25.7) and nightly (v26.1).


MAX Best Practices

Best practices for the MAX AI inference framework. 33+ rules across 8 categories.

Version Support

This skill supports both stable and nightly MAX versions:

Version MAX Rules Directory
Stable v25.7 rules/ + rules/stable/
Nightly v26.1 rules/ + rules/nightly/

Detect your version: Run max version or check pixi list | grep max

Key differences:

Feature Stable (v25.7) Nightly (v26.1)
Batch size semantics Aggregate across replicas Per-replica with DP
Driver API max.driver.Tensor max.driver.Buffer
Prefill chunk size prefill_chunk_size max_batch_input_tokens
Max context length max_batch_context_length max_batch_total_tokens
CE batch size CLI --max-ce-batch-size Deprecated β†’ --max-batch-size
Scheduling Default --kvcache-ce-watermark (new)
Llama 3.2 Vision Supported Removed
Gemma3 Vision Not available Supported (12B, 27B)
V1 layer classes Deprecated Removed
Apple silicon accelerator_count() = 0 Returns non-zero
Streams Blocking option All non-blocking

stable changelog | nightly changelog | breaking changes

Related: mojo-best-practices for Mojo language and GPU kernel development.

Quick Decision Guide

Goal Category Key Rules
Deploy model endpoint MAX Serve serve-batch-config, serve-kv-cache-strategy
Multi-GPU inference Parallelism multigpu-tensor-parallel, multigpu-batch-semantics
Build custom model MAX Graph graph-construction, graph-modules
Optimize latency Performance perf-prefix-caching, perf-chunked-prefill
Production deployment Deployment deploy-container, deploy-kubernetes
Write custom kernels Engine + Mojo engine-custom-ops + mojo gpu-* rules

Rule Categories

Priority Category Count Prefix
CRITICAL MAX Serve Configuration 7 serve-
CRITICAL Multi-GPU & Parallelism 5 multigpu-
HIGH MAX Engine 4 engine-
HIGH MAX Graph API 4 graph-
HIGH Model Loading 2 model-
MEDIUM Performance Optimization 3 perf-
MEDIUM Deployment 3 deploy-

MAX Serve (CRITICAL)

Rule Pattern
serve-batch-config --max-batch-size, --max-batch-input-tokens
serve-kv-cache-strategy PAGED with --kv-cache-page-size (multiple of 128)
serve-prefix-caching --enable-prefix-caching for common prefixes
serve-structured-output --enable-structured-output, JSON schemas
serve-function-calling Tool use, OpenAI-compatible format
serve-streaming SSE chunked responses for TTFT
serve-health-endpoints /health for readiness checks
serve-metrics Prometheus metrics, TTFT, ITL

Multi-GPU (CRITICAL)

Rule Pattern
multigpu-tensor-parallel --data-parallel-degree N --devices gpu:0,1,...
multigpu-batch-semantics Per-replica batch size (v26.1+ change)
multigpu-device-selection --devices gpu:0,1,2,3 (comma-separated)
multigpu-amd-mi300 MI300X/MI325X/MI355X support
multigpu-nvidia-hopper H100/H200/B200 optimizations

MAX Engine (HIGH)

Rule Pattern
engine-inference-session InferenceSession(devices=[Accelerator()])
engine-custom-ops @compiler.register, InputTensor, OutputTensor
engine-graph-caching Kernel caching (28% faster compilation)
engine-subgraphs Graph.add_subgraph() for device-aware scheduling

MAX Graph API (HIGH)

Rule Pattern
graph-construction Graph(TensorType(...)), graph.output()
graph-modules max.nn.Module, Sequential, state_dict()
graph-quantization Graph.quantize(), qmatmul()
graph-symbolic-dims AlgebraicDim("batch") for dynamic shapes

Performance (MEDIUM)

Rule Pattern
serve-prefix-caching 10-50% throughput improvement
perf-kv-swapping --enable-kvcache-swapping-to-host
perf-chunked-prefill --max-batch-input-tokens
engine-graph-caching 28% faster with kernel caching

Deployment (MEDIUM)

Rule Pattern
deploy-container modular/max-nvidia-full:latest
deploy-kubernetes Helm charts, readiness probes
deploy-benchmark max benchmark, benchmark_serving.py

Cross-References with Mojo

For GPU kernel development, see mojo-best-practices:
- Custom ops β†’ engine-custom-ops + mojo gpu-fundamentals
- GPU memory β†’ mojo gpu-memory-optimization
- Tensor cores β†’ mojo gpu-tensor-core-sm90-sm100
- Warp primitives β†’ mojo gpu-warp-primitives

File Structure

skills/max-best-practices/
β”œβ”€β”€ SKILL.md               # Quick reference (this file)
β”œβ”€β”€ AGENTS.md              # Auto-generated rule index
β”œβ”€β”€ metadata.json          # Skill metadata
β”œβ”€β”€ CHANGELOG.md           # Skill version history
β”œβ”€β”€ reference/
β”‚   β”œβ”€β”€ breaking-changes.md
β”‚   └── cli-flags.md
└── rules/                 # Version-agnostic rules (~30+)
    β”œβ”€β”€ serve-*.md
    β”œβ”€β”€ multigpu-*.md
    β”œβ”€β”€ engine-*.md
    β”œβ”€β”€ graph-*.md
    β”œβ”€β”€ perf-*.md
    β”œβ”€β”€ deploy-*.md
    β”œβ”€β”€ stable/            # Stable-only rules (v25.7)
    β”‚   β”œβ”€β”€ multigpu-batch-semantics.md
    β”‚   └── driver-tensor-api.md
    └── nightly/           # Nightly-only rules (v26.1)
        β”œβ”€β”€ multigpu-batch-semantics.md
        β”œβ”€β”€ driver-buffer-api.md
        β”œβ”€β”€ serve-kvcache-watermark.md
        └── model-vision-changes.md

Local Implementation Notes

When using this skill in a project, agents should collect implementation notes locally within that project, not globally. This ensures project-specific learnings stay with the project.

Where to store notes:

your-project/
β”œβ”€β”€ IMPLEMENTATION_NOTES.md    # Project-specific learnings
β”œβ”€β”€ .cursor/
β”‚   └── rules/                 # Project-specific rules
└── ...

What to capture:
- Model-specific configuration that worked
- Performance tuning for your hardware (GPU type, memory)
- Batch size optimizations for your workload
- Deployment configuration decisions
- Integration patterns with your infrastructure

Usage: Agents should check for and update IMPLEMENTATION_NOTES.md in the project root when discovering new patterns or resolving issues.

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.