aiperf-benchmark

Name: aiperf-benchmark
Author: hnt2601

by @hnt2601 in AI & LLM

# Install this skill:

npx skills add hnt2601/claude-skills

Or install specific skill: npx add-skill https://github.com/hnt2601/claude-skills/tree/main/SKILLS/aiperf-benchmark

# Description

Benchmarking AI models with Nvidia Aiperf and analyzing benchmark results. Use when the user wants to run performance benchmarks on LLM inference endpoints, analyze Aiperf CSV/JSON benchmark output files, generate performance reports from benchmark data, compare model performance metrics like TTFT, ITL, throughput, or set up benchmark configurations for vLLM, TGI, or other inference servers. Triggers on keywords like aiperf, benchmark, TTFT, ITL, throughput, inference performance, model benchmark.

# SKILL.md

name: aiperf-benchmark
description: "Benchmarking AI models with Nvidia Aiperf and analyzing benchmark results. Use when the user wants to run performance benchmarks on LLM inference endpoints, analyze Aiperf CSV/JSON benchmark output files, generate performance reports from benchmark data, compare model performance metrics like TTFT, ITL, throughput, or set up benchmark configurations for vLLM, TGI, or other inference servers. Triggers on keywords like aiperf, benchmark, TTFT, ITL, throughput, inference performance, model benchmark."

Aiperf Benchmark Skill

Aiperf (AI Performance) is a comprehensive benchmarking tool from NVIDIA's ai-dynamo project that measures performance of generative AI models served by inference solutions.

Installation

pip install aiperf --break-system-packages

Quick Start Commands

Basic Chat Benchmarking

aiperf profile --model <model-name> --url <server-url> --endpoint-type chat --streaming

Concurrency-Based Benchmarking

aiperf profile --model <model-name> --url <server-url> --concurrency 10 --request-count 100

Request Rate Benchmarking (Poisson Distribution)

aiperf profile --model <model-name> --url <server-url> --request-rate 5.0 --benchmark-duration 60

Multi-Turn Conversations with ShareGPT

aiperf profile --model <model-name> --url <server-url> --public-dataset sharegpt --num-sessions 50

Key CLI Options

See references/cli_options.md for the complete CLI reference.

Essential Parameters

Parameter	Description
`-m, --model`	Model name(s) to benchmark (required)
`-u, --url`	Server URL (default: localhost:8000)
`--endpoint-type`	API type: chat, completions, embeddings, etc.
`--streaming`	Enable streaming responses for TTFT/ITL metrics

Load Configuration

Parameter	Description
`--concurrency`	Number of concurrent requests to maintain
`--request-rate`	Target requests per second
`--request-count`	Maximum number of requests to send
`--benchmark-duration`	Maximum benchmark runtime in seconds
`--arrival-pattern`	constant, poisson (default), gamma, concurrency_burst

Input Configuration

Parameter	Description
`--isl`	Mean input sequence length (tokens)
`--isl-stddev`	Standard deviation for input length
`--osl`	Mean output sequence length (tokens)
`--osl-stddev`	Standard deviation for output length
`--input-file`	Custom dataset path (JSONL)
`--public-dataset`	Use public dataset (e.g., sharegpt)

Output Configuration

Parameter	Description
`--artifact-dir`	Output directory (default: artifacts)
`--export-level`	summary, records (default), or raw
`--slice-duration`	Duration for time-sliced analysis

Output Files

Aiperf generates several output files in the artifact directory:

profile_export_aiperf.csv - Summary metrics in CSV
profile_export_aiperf.json - Summary with metadata
profile_export.jsonl - Per-request metrics
profile_export_raw.jsonl - Raw request/response data (if --export-level raw)
*_timeslices.csv - Time-windowed metrics (if --slice-duration set)
*_gpu_telemetry.jsonl - GPU metrics (if --gpu-telemetry enabled)
*_server_metrics.* - Server-side Prometheus metrics

Analyzing Benchmark Results

Use scripts/analyze_benchmark.py to analyze CSV output:

python scripts/analyze_benchmark.py /path/to/profile_export_aiperf.csv

Key Metrics in Output

Metric	Description
`time_to_first_token_s`	Time to first token (TTFT)
`inter_token_latency_s`	Inter-token latency (ITL)
`request_latency_s`	End-to-end request latency
`output_token_throughput_per_request`	Tokens/second per request
`input_tokens`, `output_tokens`	Token counts
`successful_requests`, `failed_requests`	Request status

CSV Analysis Workflow

Load the CSV with pandas
Filter by successful requests
Calculate percentiles (p50, p90, p95, p99) for latency metrics
Compute aggregate throughput
Generate comparison charts if multiple runs

import pandas as pd

df = pd.read_csv('profile_export_aiperf.csv')
# Filter successful requests
df_success = df[df['request_output_error'].isna()]

# Key metrics
print(f"TTFT p50: {df_success['time_to_first_token_s'].quantile(0.5):.3f}s")
print(f"TTFT p99: {df_success['time_to_first_token_s'].quantile(0.99):.3f}s")
print(f"ITL p50: {df_success['inter_token_latency_s'].quantile(0.5)*1000:.2f}ms")
print(f"Throughput: {df_success['output_token_throughput_per_request'].mean():.1f} tok/s")

Visualization

Use aiperf plot to generate visualizations:

aiperf plot --paths ./artifacts --output ./plots

Or launch interactive dashboard:

aiperf plot --dashboard --port 8050

Common Benchmark Scenarios

Latency-Focused (Interactive Use)

aiperf profile --model <model> --url <url> --streaming \
  --concurrency 1 --request-count 100 --isl 512 --osl 256

Throughput-Focused (Batch Processing)

aiperf profile --model <model> --url <url> \
  --concurrency 32 --request-rate 10 --benchmark-duration 300

Goodput with SLOs

aiperf profile --model <model> --url <url> --streaming \
  --concurrency 16 --goodput "request_latency:250 inter_token_latency:10"

KV Cache Testing

aiperf profile --model <model> --url <url> --streaming \
  --num-prefix-prompts 10 --prefix-prompt-length 2048 \
  --isl 512 --osl 128 --concurrency 8

Endpoint Types

Type	Description
`chat`	OpenAI Chat Completions (default)
`completions`	OpenAI Completions (legacy)
`embeddings`	Vector embeddings generation
`rankings`	Passage reranking
`image_generation`	Image generation (FLUX.1, etc.)
`huggingface_generate`	HuggingFace TGI API

# README.md

Claude Code Skills Library for AIOps Engineers

A curated collection of Claude Code agents, skills, and commands for building and operating enterprise AI products.

MCP Server Setup

Kubernetes

claude mcp add k8s -e KUBECONFIG=~/.kube/config -- npx -y @modelcontextprotocol/server-kubernetes --read-only

Example prompts:
- List all pods in llms namespace and their status
- Debug pod nginx-abc123 in default namespace. Check status, logs, events, and resource usage
- Fix CrashLoopBackOff in pod app-xyz namespace staging. Check previous logs, deployment spec, events; patch resources
- List all Helm releases in prod namespace and their status
- Troubleshoot why Helm chart nginx failed to deploy. Check deployments, pods, logs, and events

Docker

claude mcp add docker -- npx -y @modelcontextprotocol/server-docker

Agents

Specialized Claude agents for each phase of the AI product lifecycle.

Usage: Use <agent-name> to <task>

Design & Architecture

Agent	Model	Description
docs-architect	opus	Technical documentation generation
tdd-orchestrator	opus	TDD workflow orchestration, test-first development

Planning

Agent	Model	Description
kubernetes-architect	opus	K8s/GitOps architecture, EKS/AKS/GKE, service mesh, platform engineering

Development

Agent	Model	Description
bash-pro	sonnet	Shell scripting, automation
cpp-pro	sonnet	C++ development, performance optimization
rust-engineer	sonnet	Rust development, memory safety
mcp-developer	sonnet	MCP server development

Review & Quality

Agent	Model	Description
code-reviewer	opus	Code quality, security vulnerabilities, performance analysis
architect-reviewer	opus	System design validation, architectural patterns, scalability analysis
qa-expert	opus	Testing strategies, quality assurance

Operations

Agent	Model	Description
debugger	sonnet	Root cause analysis, systematic debugging
devops-troubleshooter	sonnet	Infrastructure issue diagnosis
refactoring-specialist	sonnet	Code improvement, technical debt reduction
git-workflow-manager	sonnet	Git operations, branching strategies
prompt-engineer	sonnet	Prompt optimization, LLM tuning

Agent-Skill Integration

Recommended skills for each agent to maximize effectiveness.

Agent	Recommended Skills
docs-architect	`generating-documentation`, `writing-plans`, `langchain-architecture`
tdd-orchestrator	`python-testing-patterns`, `writing-plans`, `python-design-patterns`
kubernetes-architect	`helm-chart-scaffolding`, `k8s-manifest-generator`, `k8s-security-policies`, `implementing-gitops`, `planning-disaster-recovery`
bash-pro	`writing-dockerfiles`, `implementing-gitops`
cpp-pro	`high-performance-inference`, `flash-attention`, `debug-cuda-crash`
rust-engineer	`high-performance-inference`, `async-python-patterns`, `qdrant`
mcp-developer	`langchain-architecture`, `prompt-engineering-patterns`, `python-error-handling`
code-reviewer	`python-design-patterns`, `python-testing-patterns`, `python-error-handling`, `k8s-security-policies`
architect-reviewer	`llm-serving-patterns`, `implementing-mlops`, `planning-disaster-recovery`, `slo-implementation`
qa-expert	`python-testing-patterns`, `evaluating-llms-harness`, `slo-implementation`
debugger	`debug-cuda-crash`, `python-error-handling`, `python-testing-patterns`
devops-troubleshooter	`operating-kubernetes`, `prometheus-configuration`, `grafana-dashboards`, `debug-cuda-crash`, `implementing-gitops`
refactoring-specialist	`python-design-patterns`, `python-testing-patterns`, `async-python-patterns`
git-workflow-manager	`implementing-gitops`, `writing-plans`
prompt-engineer	`prompt-engineering-patterns`, `langsmith`, `evaluating-llms-harness`, `langchain-architecture`

Skills

Domain-specific knowledge bases for AI product development.

Usage: /<skill-name> <task>

Planning & Design

Skill	Description
brainstorming	Ideation and exploration techniques
writing-plans	Implementation planning with TDD
notebooklm	Query Google NotebookLM for research
planning-disaster-recovery	DR planning and resilience

Python Development

Skill	Description
async-python-patterns	Async/await, concurrency patterns
python-design-patterns	Design patterns in Python
python-error-handling	Exception handling, error recovery
python-testing-patterns	pytest, mocking, test strategies

LLM Serving & Inference

Skill	Description
llm-serving-patterns	Architecture patterns for LLM APIs
vllm	High-throughput LLM serving with PagedAttention
serving-llms-vllm	Production vLLM deployment
sglang	Structured generation, constrained decoding
tensorrt-llm	NVIDIA TensorRT-LLM optimization
high-performance-inference	Inference optimization strategies
awq	Activation-aware weight quantization
flash-attention	Efficient attention mechanisms
helm-chart-vllm	Helm chart vLLM deployment
aiperf-benchmark	LLM performance benchmarking

AI/ML Engineering

Skill	Description
implementing-mlops	End-to-end MLOps: MLflow, feature stores, model serving
evaluating-llms-harness	LLM evaluation with lm-evaluation-harness
langchain-architecture	LangChain/LangGraph patterns
langsmith	LLM observability and tracing
prompt-engineering-patterns	Prompt design, few-shot, chain-of-thought
qdrant	Vector database operations
rag-implementation	RAG systems, semantic search

Kubernetes & Infrastructure

Skill	Description
helm-chart-scaffolding	Helm chart development
k8s-manifest-generator	Kubernetes manifest generation
k8s-security-policies	RBAC, network policies, pod security
operating-kubernetes	K8s cluster operations
writing-dockerfiles	Dockerfile best practices

Monitoring & Observability

Skill	Description
grafana-dashboards	Grafana dashboard design
prometheus-configuration	Prometheus setup and alerting
slo-implementation	SLO/SLI patterns, error budgets

GitOps & Documentation

Skill	Description
implementing-gitops	ArgoCD, Flux, GitOps workflows
guiding-users	User guidance and onboarding
generating-documentation	Auto-generate technical docs

Debugging

Skill	Description
debug-cuda-crash	CUDA debugging, GPU troubleshooting

Commands

Slash commands for common development tasks.

Command	Description
`/commit`	Create git commits with conventional format
`/tech-debt`	Analyze and remediate technical debt
`/refactor-clean`	Refactor code for quality and maintainability
`/langchain-agent`	Create LangGraph-based agents
`/prompt-optimize`	Optimize prompts for production LLMs

Workflow Examples

Troubleshoot Production Issues

1. use debugger to analyze error logs and stack traces
2. use devops-troubleshooter to check infrastructure
3. /debug-cuda-crash if GPU-related issues
4. use code-reviewer to identify root cause in code
5. /commit fix with conventional commit message

End-to-End LLM Deployment with K8s & Helm

A comprehensive workflow from ideation to production deployment of an LLM serving infrastructure.

Phase 1: Research & Ideation

1. /notebooklm query research notebooks for LLM serving best practices
2. /brainstorming explore deployment requirements and constraints

Phase 2: Architecture & Design

3. use docs-architect to create system design documentation
4. use architect-reviewer to validate architecture decisions

Phase 3: Planning

5. /writing-plans create implementation plan with TDD approach
6. /planning-disaster-recovery define RTO/RPO and backup strategies

Phase 4: Implementation

7. /vllm configure model serving with tensor parallelism
8. /high-performance-inference optimize with AWQ quantization
9. /k8s-manifest-generator create Deployment, Service, ConfigMap
10. /helm-chart-scaffolding or /helm-chart-vllm package as reusable Helm chart

Phase 5: Review & Quality

11. use code-reviewer to analyze security and performance
12. use refactoring-specialist if code improvements needed

Phase 6: Benchmarking & Debugging

13. Deploy to staging cluster with helm install
14. use devops-troubleshooter if pod issues occur
15. use debugger for application-level errors
16. /debug-cuda-crash if GPU-related issues
17. /aiperf-benchmark to benchmark performance of model

Phase 7: Documentation

18. /generating-documentation create deployment runbook and API docs
19. /commit document changes with conventional commit

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.