Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add yzlnew/infra-skills --skill "tilelang-developer"
Install specific skill from multi-skill repository
# Description
Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors.
# SKILL.md
name: tilelang-developer
description: "Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors."
TileLang Developer
Write high-performance AI compute kernels using TileLang - a tile-based programming model that bridges the gap between CUDA's low-level control and high-level abstractions.
When to Use This Skill
Use this skill when the user needs to:
- Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
- Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
- Debug TileLang code or resolve performance issues
- Port kernels across different hardware platforms
- Understand or modify existing TileLang implementations
Kernel Development Workflow
Follow these steps when writing a TileLang kernel:
Step 1: Analyze Requirements
Gather essential information:
Input/Output Specifications:
- Tensor shapes (M, N, K dimensions)
- Data types (float16, float32, bfloat16, int8)
- Memory layout (row-major, column-major)
Hardware Target:
- NVIDIA GPU (Ampere A100, Hopper H100, etc.)
- AMD GPU (MI300X, etc.)
- Huawei Ascend NPU
Performance Goals:
- Target throughput or latency
- Memory bandwidth constraints
- Comparison baseline (cuBLAS, vendor libraries)
Ask clarifying questions if details are missing.
Step 2: Set Up Kernel Structure
Create the basic kernel scaffold:
import tilelang
import tilelang.language as T
@tilelang.jit(target="cuda", out_idx=[2]) # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
@T.prim_func
def main(
A: T.Buffer((M, K), "float16"),
B: T.Buffer((K, N), "float16"),
C: T.Buffer((M, N), "float16")
):
# Kernel logic will go here
pass
return main
Key decisions:
- target: "cuda" (NVIDIA), "hip" (AMD), or "cpu"
- out_idx: List indices of output parameters
- Block dimensions: Typical values are 64, 128, or 256
Step 3: Define Grid and Memory Hierarchy
Set up computation grid and allocate memory:
# Define grid dimensions
with T.Kernel(
T.ceildiv(N, block_N), # Grid X
T.ceildiv(M, block_M), # Grid Y
threads=128
) as (bx, by):
# Allocate shared memory (L1 cache)
A_shared = T.alloc_shared((block_M, block_K), "float16")
B_shared = T.alloc_shared((block_K, block_N), "float16")
# Allocate register fragments (accumulators)
C_local = T.alloc_fragment((block_M, block_N), "float32")
# CRITICAL: Apply swizzle layout to avoid bank conflicts
T.annotate_layout({
A_shared: T.make_swizzled_layout(A_shared),
B_shared: T.make_swizzled_layout(B_shared)
})
Memory hierarchy:
- Global Memory (HBM): Input/output tensors, slowest
- Shared Memory (L1): Explicitly managed cache, ~164KB on A100
- Registers: Fastest, used for accumulators and temporaries
Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.
Step 4: Implement Computation Logic
Use TileLang primitives for data movement and computation:
# Initialize accumulator
T.clear(C_local)
# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
# Load tiles from global to shared memory
T.copy(A[by * block_M, k * block_K], A_shared)
T.copy(B[k * block_K, bx * block_N], B_shared)
# Compute using Tensor Cores
T.gemm(A_shared, B_shared, C_local, transpose_B=False)
# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])
Key primitives:
- T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)
- T.gemm: Matrix multiplication using Tensor Cores
- T.Pipelined: Software pipelining to overlap compute and memory transfer
- T.Parallel: Element-wise parallel operations
Pipeline stages:
- num_stages=2: Double buffering
- num_stages=3: Triple buffering (recommended for most workloads)
- num_stages=4+: Diminishing returns, increases shared memory usage
Step 5: Validate and Test
Generate test code to verify correctness:
# Example instantiation
func = kernel_name(
M=1024, N=1024, K=1024,
block_M=128, block_N=128, block_K=32
)
# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B
func(A, B, C_tilelang)
# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)
Step 6: Optimize Performance
Apply advanced optimizations if performance is suboptimal:
Block Size Tuning:
- A100: Try 128×128×32 or 64×64×32
- H100: Can use larger blocks (256×128×32)
- MI300X: May need smaller blocks due to 64KB shared memory limit
Pipeline Depth:
- Increase num_stages if memory-bound
- Decrease if shared memory is exhausted
Warp Policy (for advanced cases):
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow) # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol) # For MLA decode
Block-level swizzle:
T.use_swizzle(panel_size=10) # Improves L2 cache hit rate
Common Kernel Patterns
Matrix Multiplication (GEMM)
Most fundamental kernel. See EXAMPLES.md for complete implementation.
Key features:
- 3-stage pipelining
- Swizzle layout for shared memory
- Float32 accumulator for precision
FlashAttention
Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.
Key features:
- O(N) memory complexity
- Online softmax statistics
- Fused kernel (no intermediate materialization)
DeepSeek MLA
Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.
Key features:
- Split-KV parallelization
- Non-standard dimensions
- FullCol warp policy for narrow matrices
Reference Documentation
When you need specific information:
- API details (parameters, signatures, options): Read API_REFERENCE.md
- Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
- Troubleshooting (errors, performance issues): Read DEBUGGING.md
Critical Performance Guidelines
Always include these optimizations:
-
Swizzle layout for shared memory:
python T.annotate_layout({ A_shared: T.make_swizzled_layout(A_shared) }) -
Software pipelining:
python for k in T.Pipelined(num_blocks, num_stages=3): -
Float32 accumulators:
python C_local = T.alloc_fragment((M, N), "float32") # Not float16 -
Aligned block_K:
python block_K = 32 # Or 16, must align for Tensor Core -
Initialize accumulators:
python T.clear(C_local)
Hardware-Specific Considerations
NVIDIA GPUs
- Ampere (A100): Use cp.async, num_stages=3, block_K=32
- Hopper (H100): Can use TMA, larger blocks (256×128), num_stages=4
- Shared memory: 164KB (A100), 228KB (H100)
AMD GPUs
- MI300X: Use target="hip", smaller blocks, 64KB shared memory limit
- Test with both HIP and CUDA backends for compatibility
Huawei Ascend
- More experimental backend
- May require specific block sizes
- Consult Ascend-specific documentation
Code Quality Standards
When generating TileLang code:
- Add explanatory comments for non-obvious optimizations
- Specify hardware assumptions (e.g., "optimized for A100")
- Include usage examples showing instantiation
- Document block size choices and tuning rationale
- Provide performance expectations (e.g., "~90% of cuBLAS")
Example Kernel Request Flow
User: "Write a FP16 matrix multiplication kernel for A100"
Response:
1. Clarify dimensions (if not specified)
2. Generate complete kernel code with:
- Proper structure (@tilelang.jit, @T.prim_func)
- Swizzle layouts
- 3-stage pipelining
- Appropriate block sizes (128×128×32)
3. Add usage example
4. Explain key optimizations:
- "Swizzle layout eliminates bank conflicts"
- "3-stage pipeline overlaps memory and compute"
- "Float32 accumulator prevents overflow"
5. Suggest testing approach
Troubleshooting Quick Reference
Compilation errors:
- Shared memory exceeded → Reduce block size or num_stages
- Shape mismatch → Verify dimension alignment in T.gemm
Runtime errors:
- Results all zeros → Check T.clear() and out_idx in decorator
- NaN/Inf → Use float32 accumulator, add epsilon in division
Performance issues:
- Low throughput → Verify swizzle layout and pipelining enabled
- Low occupancy → Reduce shared memory usage or block size
- Bank conflicts → Apply T.make_swizzled_layout
For detailed solutions, consult DEBUGGING.md.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.