tilelang-developer

by @yzlnew in AI & LLM

# Install this skill:

npx skills add yzlnew/infra-skills --skill "tilelang-developer"

Install specific skill from multi-skill repository

# Description

Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors.

# SKILL.md

name: tilelang-developer
description: "Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors."

TileLang Developer

Write high-performance AI compute kernels using TileLang - a tile-based programming model that bridges the gap between CUDA's low-level control and high-level abstractions.

When to Use This Skill

Use this skill when the user needs to:
- Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
- Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
- Debug TileLang code or resolve performance issues
- Port kernels across different hardware platforms
- Understand or modify existing TileLang implementations

Kernel Development Workflow

Follow these steps when writing a TileLang kernel:

Step 1: Analyze Requirements

Gather essential information:

Input/Output Specifications:
- Tensor shapes (M, N, K dimensions)
- Data types (float16, float32, bfloat16, int8)
- Memory layout (row-major, column-major)

Hardware Target:
- NVIDIA GPU (Ampere A100, Hopper H100, etc.)
- AMD GPU (MI300X, etc.)
- Huawei Ascend NPU

Performance Goals:
- Target throughput or latency
- Memory bandwidth constraints
- Comparison baseline (cuBLAS, vendor libraries)

Ask clarifying questions if details are missing.

Step 2: Set Up Kernel Structure

Create the basic kernel scaffold:

import tilelang
import tilelang.language as T

@tilelang.jit(target="cuda", out_idx=[2])  # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
    @T.prim_func
    def main(
        A: T.Buffer((M, K), "float16"),
        B: T.Buffer((K, N), "float16"),
        C: T.Buffer((M, N), "float16")
    ):
        # Kernel logic will go here
        pass

    return main

Key decisions:
- target: "cuda" (NVIDIA), "hip" (AMD), or "cpu"
- out_idx: List indices of output parameters
- Block dimensions: Typical values are 64, 128, or 256

Step 3: Define Grid and Memory Hierarchy

Set up computation grid and allocate memory:

# Define grid dimensions
with T.Kernel(
    T.ceildiv(N, block_N),  # Grid X
    T.ceildiv(M, block_M),  # Grid Y
    threads=128
) as (bx, by):

    # Allocate shared memory (L1 cache)
    A_shared = T.alloc_shared((block_M, block_K), "float16")
    B_shared = T.alloc_shared((block_K, block_N), "float16")

    # Allocate register fragments (accumulators)
    C_local = T.alloc_fragment((block_M, block_N), "float32")

    # CRITICAL: Apply swizzle layout to avoid bank conflicts
    T.annotate_layout({
        A_shared: T.make_swizzled_layout(A_shared),
        B_shared: T.make_swizzled_layout(B_shared)
    })

Memory hierarchy:
- Global Memory (HBM): Input/output tensors, slowest
- Shared Memory (L1): Explicitly managed cache, ~164KB on A100
- Registers: Fastest, used for accumulators and temporaries

Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.

Step 4: Implement Computation Logic

Use TileLang primitives for data movement and computation:

# Initialize accumulator
T.clear(C_local)

# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
    # Load tiles from global to shared memory
    T.copy(A[by * block_M, k * block_K], A_shared)
    T.copy(B[k * block_K, bx * block_N], B_shared)

    # Compute using Tensor Cores
    T.gemm(A_shared, B_shared, C_local, transpose_B=False)

# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])

Key primitives:
- T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)
- T.gemm: Matrix multiplication using Tensor Cores
- T.Pipelined: Software pipelining to overlap compute and memory transfer
- T.Parallel: Element-wise parallel operations

Pipeline stages:
- num_stages=2: Double buffering
- num_stages=3: Triple buffering (recommended for most workloads)
- num_stages=4+: Diminishing returns, increases shared memory usage

Step 5: Validate and Test

Generate test code to verify correctness:

# Example instantiation
func = kernel_name(
    M=1024, N=1024, K=1024,
    block_M=128, block_N=128, block_K=32
)

# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B

func(A, B, C_tilelang)

# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)

Step 6: Optimize Performance

Apply advanced optimizations if performance is suboptimal:

Block Size Tuning:
- A100: Try 128×128×32 or 64×64×32
- H100: Can use larger blocks (256×128×32)
- MI300X: May need smaller blocks due to 64KB shared memory limit

Pipeline Depth:
- Increase num_stages if memory-bound
- Decrease if shared memory is exhausted

Warp Policy (for advanced cases):

T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow)  # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol)  # For MLA decode

Block-level swizzle:

T.use_swizzle(panel_size=10)  # Improves L2 cache hit rate

Common Kernel Patterns

Matrix Multiplication (GEMM)

Most fundamental kernel. See EXAMPLES.md for complete implementation.

Key features:
- 3-stage pipelining
- Swizzle layout for shared memory
- Float32 accumulator for precision

FlashAttention

Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.

Key features:
- O(N) memory complexity
- Online softmax statistics
- Fused kernel (no intermediate materialization)

DeepSeek MLA

Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.

Key features:
- Split-KV parallelization
- Non-standard dimensions
- FullCol warp policy for narrow matrices

Reference Documentation

When you need specific information:

API details (parameters, signatures, options): Read API_REFERENCE.md
Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
Troubleshooting (errors, performance issues): Read DEBUGGING.md

Critical Performance Guidelines

Always include these optimizations:

Swizzle layout for shared memory:
python T.annotate_layout({ A_shared: T.make_swizzled_layout(A_shared) })
Software pipelining:
python for k in T.Pipelined(num_blocks, num_stages=3):
Float32 accumulators:
python C_local = T.alloc_fragment((M, N), "float32") # Not float16
Aligned block_K:
python block_K = 32 # Or 16, must align for Tensor Core
Initialize accumulators:
python T.clear(C_local)

Hardware-Specific Considerations

NVIDIA GPUs

Ampere (A100): Use cp.async, num_stages=3, block_K=32
Hopper (H100): Can use TMA, larger blocks (256×128), num_stages=4
Shared memory: 164KB (A100), 228KB (H100)

AMD GPUs

MI300X: Use target="hip", smaller blocks, 64KB shared memory limit
Test with both HIP and CUDA backends for compatibility

Huawei Ascend

More experimental backend
May require specific block sizes
Consult Ascend-specific documentation

Code Quality Standards

When generating TileLang code:

Add explanatory comments for non-obvious optimizations
Specify hardware assumptions (e.g., "optimized for A100")
Include usage examples showing instantiation
Document block size choices and tuning rationale
Provide performance expectations (e.g., "~90% of cuBLAS")

Example Kernel Request Flow

User: "Write a FP16 matrix multiplication kernel for A100"

Response:
1. Clarify dimensions (if not specified)
2. Generate complete kernel code with:
- Proper structure (@tilelang.jit, @T.prim_func)
- Swizzle layouts
- 3-stage pipelining
- Appropriate block sizes (128×128×32)
3. Add usage example
4. Explain key optimizations:
- "Swizzle layout eliminates bank conflicts"
- "3-stage pipeline overlaps memory and compute"
- "Float32 accumulator prevents overflow"
5. Suggest testing approach

Troubleshooting Quick Reference

Compilation errors:
- Shared memory exceeded → Reduce block size or num_stages
- Shape mismatch → Verify dimension alignment in T.gemm

Runtime errors:
- Results all zeros → Check T.clear() and out_idx in decorator
- NaN/Inf → Use float32 accumulator, add epsilon in division

Performance issues:
- Low throughput → Verify swizzle layout and pipelining enabled
- Low occupancy → Reduce shared memory usage or block size
- Bank conflicts → Apply T.make_swizzled_layout

For detailed solutions, consult DEBUGGING.md.

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.