hnt2601

aiperf-benchmark

0
0
# Install this skill:
npx skills add hnt2601/claude-skills

Or install specific skill: npx add-skill https://github.com/hnt2601/claude-skills/tree/main/SKILLS/aiperf-benchmark

# Description

Benchmarking AI models with Nvidia Aiperf and analyzing benchmark results. Use when the user wants to run performance benchmarks on LLM inference endpoints, analyze Aiperf CSV/JSON benchmark output files, generate performance reports from benchmark data, compare model performance metrics like TTFT, ITL, throughput, or set up benchmark configurations for vLLM, TGI, or other inference servers. Triggers on keywords like aiperf, benchmark, TTFT, ITL, throughput, inference performance, model benchmark.

# SKILL.md


name: aiperf-benchmark
description: "Benchmarking AI models with Nvidia Aiperf and analyzing benchmark results. Use when the user wants to run performance benchmarks on LLM inference endpoints, analyze Aiperf CSV/JSON benchmark output files, generate performance reports from benchmark data, compare model performance metrics like TTFT, ITL, throughput, or set up benchmark configurations for vLLM, TGI, or other inference servers. Triggers on keywords like aiperf, benchmark, TTFT, ITL, throughput, inference performance, model benchmark."


Aiperf Benchmark Skill

Aiperf (AI Performance) is a comprehensive benchmarking tool from NVIDIA's ai-dynamo project that measures performance of generative AI models served by inference solutions.

Installation

pip install aiperf --break-system-packages

Quick Start Commands

Basic Chat Benchmarking

aiperf profile --model <model-name> --url <server-url> --endpoint-type chat --streaming

Concurrency-Based Benchmarking

aiperf profile --model <model-name> --url <server-url> --concurrency 10 --request-count 100

Request Rate Benchmarking (Poisson Distribution)

aiperf profile --model <model-name> --url <server-url> --request-rate 5.0 --benchmark-duration 60

Multi-Turn Conversations with ShareGPT

aiperf profile --model <model-name> --url <server-url> --public-dataset sharegpt --num-sessions 50

Key CLI Options

See references/cli_options.md for the complete CLI reference.

Essential Parameters

Parameter Description
-m, --model Model name(s) to benchmark (required)
-u, --url Server URL (default: localhost:8000)
--endpoint-type API type: chat, completions, embeddings, etc.
--streaming Enable streaming responses for TTFT/ITL metrics

Load Configuration

Parameter Description
--concurrency Number of concurrent requests to maintain
--request-rate Target requests per second
--request-count Maximum number of requests to send
--benchmark-duration Maximum benchmark runtime in seconds
--arrival-pattern constant, poisson (default), gamma, concurrency_burst

Input Configuration

Parameter Description
--isl Mean input sequence length (tokens)
--isl-stddev Standard deviation for input length
--osl Mean output sequence length (tokens)
--osl-stddev Standard deviation for output length
--input-file Custom dataset path (JSONL)
--public-dataset Use public dataset (e.g., sharegpt)

Output Configuration

Parameter Description
--artifact-dir Output directory (default: artifacts)
--export-level summary, records (default), or raw
--slice-duration Duration for time-sliced analysis

Output Files

Aiperf generates several output files in the artifact directory:

  • profile_export_aiperf.csv - Summary metrics in CSV
  • profile_export_aiperf.json - Summary with metadata
  • profile_export.jsonl - Per-request metrics
  • profile_export_raw.jsonl - Raw request/response data (if --export-level raw)
  • *_timeslices.csv - Time-windowed metrics (if --slice-duration set)
  • *_gpu_telemetry.jsonl - GPU metrics (if --gpu-telemetry enabled)
  • *_server_metrics.* - Server-side Prometheus metrics

Analyzing Benchmark Results

Use scripts/analyze_benchmark.py to analyze CSV output:

python scripts/analyze_benchmark.py /path/to/profile_export_aiperf.csv

Key Metrics in Output

Metric Description
time_to_first_token_s Time to first token (TTFT)
inter_token_latency_s Inter-token latency (ITL)
request_latency_s End-to-end request latency
output_token_throughput_per_request Tokens/second per request
input_tokens, output_tokens Token counts
successful_requests, failed_requests Request status

CSV Analysis Workflow

  1. Load the CSV with pandas
  2. Filter by successful requests
  3. Calculate percentiles (p50, p90, p95, p99) for latency metrics
  4. Compute aggregate throughput
  5. Generate comparison charts if multiple runs
import pandas as pd

df = pd.read_csv('profile_export_aiperf.csv')
# Filter successful requests
df_success = df[df['request_output_error'].isna()]

# Key metrics
print(f"TTFT p50: {df_success['time_to_first_token_s'].quantile(0.5):.3f}s")
print(f"TTFT p99: {df_success['time_to_first_token_s'].quantile(0.99):.3f}s")
print(f"ITL p50: {df_success['inter_token_latency_s'].quantile(0.5)*1000:.2f}ms")
print(f"Throughput: {df_success['output_token_throughput_per_request'].mean():.1f} tok/s")

Visualization

Use aiperf plot to generate visualizations:

aiperf plot --paths ./artifacts --output ./plots

Or launch interactive dashboard:

aiperf plot --dashboard --port 8050

Common Benchmark Scenarios

Latency-Focused (Interactive Use)

aiperf profile --model <model> --url <url> --streaming \
  --concurrency 1 --request-count 100 --isl 512 --osl 256

Throughput-Focused (Batch Processing)

aiperf profile --model <model> --url <url> \
  --concurrency 32 --request-rate 10 --benchmark-duration 300

Goodput with SLOs

aiperf profile --model <model> --url <url> --streaming \
  --concurrency 16 --goodput "request_latency:250 inter_token_latency:10"

KV Cache Testing

aiperf profile --model <model> --url <url> --streaming \
  --num-prefix-prompts 10 --prefix-prompt-length 2048 \
  --isl 512 --osl 128 --concurrency 8

Endpoint Types

Type Description
chat OpenAI Chat Completions (default)
completions OpenAI Completions (legacy)
embeddings Vector embeddings generation
rankings Passage reranking
image_generation Image generation (FLUX.1, etc.)
huggingface_generate HuggingFace TGI API

# README.md

Claude Code Skills Library for AIOps Engineers

A curated collection of Claude Code agents, skills, and commands for building and operating enterprise AI products.


MCP Server Setup

Kubernetes

claude mcp add k8s -e KUBECONFIG=~/.kube/config -- npx -y @modelcontextprotocol/server-kubernetes --read-only

Example prompts:
- List all pods in llms namespace and their status
- Debug pod nginx-abc123 in default namespace. Check status, logs, events, and resource usage
- Fix CrashLoopBackOff in pod app-xyz namespace staging. Check previous logs, deployment spec, events; patch resources
- List all Helm releases in prod namespace and their status
- Troubleshoot why Helm chart nginx failed to deploy. Check deployments, pods, logs, and events

Docker

claude mcp add docker -- npx -y @modelcontextprotocol/server-docker

Agents

Specialized Claude agents for each phase of the AI product lifecycle.

Usage: Use <agent-name> to <task>

Design & Architecture

Agent Model Description
docs-architect opus Technical documentation generation
tdd-orchestrator opus TDD workflow orchestration, test-first development

Planning

Agent Model Description
kubernetes-architect opus K8s/GitOps architecture, EKS/AKS/GKE, service mesh, platform engineering

Development

Agent Model Description
bash-pro sonnet Shell scripting, automation
cpp-pro sonnet C++ development, performance optimization
rust-engineer sonnet Rust development, memory safety
mcp-developer sonnet MCP server development

Review & Quality

Agent Model Description
code-reviewer opus Code quality, security vulnerabilities, performance analysis
architect-reviewer opus System design validation, architectural patterns, scalability analysis
qa-expert opus Testing strategies, quality assurance

Operations

Agent Model Description
debugger sonnet Root cause analysis, systematic debugging
devops-troubleshooter sonnet Infrastructure issue diagnosis
refactoring-specialist sonnet Code improvement, technical debt reduction
git-workflow-manager sonnet Git operations, branching strategies
prompt-engineer sonnet Prompt optimization, LLM tuning

Agent-Skill Integration

Recommended skills for each agent to maximize effectiveness.

Agent Recommended Skills
docs-architect generating-documentation, writing-plans, langchain-architecture
tdd-orchestrator python-testing-patterns, writing-plans, python-design-patterns
kubernetes-architect helm-chart-scaffolding, k8s-manifest-generator, k8s-security-policies, implementing-gitops, planning-disaster-recovery
bash-pro writing-dockerfiles, implementing-gitops
cpp-pro high-performance-inference, flash-attention, debug-cuda-crash
rust-engineer high-performance-inference, async-python-patterns, qdrant
mcp-developer langchain-architecture, prompt-engineering-patterns, python-error-handling
code-reviewer python-design-patterns, python-testing-patterns, python-error-handling, k8s-security-policies
architect-reviewer llm-serving-patterns, implementing-mlops, planning-disaster-recovery, slo-implementation
qa-expert python-testing-patterns, evaluating-llms-harness, slo-implementation
debugger debug-cuda-crash, python-error-handling, python-testing-patterns
devops-troubleshooter operating-kubernetes, prometheus-configuration, grafana-dashboards, debug-cuda-crash, implementing-gitops
refactoring-specialist python-design-patterns, python-testing-patterns, async-python-patterns
git-workflow-manager implementing-gitops, writing-plans
prompt-engineer prompt-engineering-patterns, langsmith, evaluating-llms-harness, langchain-architecture

Skills

Domain-specific knowledge bases for AI product development.

Usage: /<skill-name> <task>

Planning & Design

Skill Description
brainstorming Ideation and exploration techniques
writing-plans Implementation planning with TDD
notebooklm Query Google NotebookLM for research
planning-disaster-recovery DR planning and resilience

Python Development

Skill Description
async-python-patterns Async/await, concurrency patterns
python-design-patterns Design patterns in Python
python-error-handling Exception handling, error recovery
python-testing-patterns pytest, mocking, test strategies

LLM Serving & Inference

Skill Description
llm-serving-patterns Architecture patterns for LLM APIs
vllm High-throughput LLM serving with PagedAttention
serving-llms-vllm Production vLLM deployment
sglang Structured generation, constrained decoding
tensorrt-llm NVIDIA TensorRT-LLM optimization
high-performance-inference Inference optimization strategies
awq Activation-aware weight quantization
flash-attention Efficient attention mechanisms
helm-chart-vllm Helm chart vLLM deployment
aiperf-benchmark LLM performance benchmarking

AI/ML Engineering

Skill Description
implementing-mlops End-to-end MLOps: MLflow, feature stores, model serving
evaluating-llms-harness LLM evaluation with lm-evaluation-harness
langchain-architecture LangChain/LangGraph patterns
langsmith LLM observability and tracing
prompt-engineering-patterns Prompt design, few-shot, chain-of-thought
qdrant Vector database operations
rag-implementation RAG systems, semantic search

Kubernetes & Infrastructure

Skill Description
helm-chart-scaffolding Helm chart development
k8s-manifest-generator Kubernetes manifest generation
k8s-security-policies RBAC, network policies, pod security
operating-kubernetes K8s cluster operations
writing-dockerfiles Dockerfile best practices

Monitoring & Observability

Skill Description
grafana-dashboards Grafana dashboard design
prometheus-configuration Prometheus setup and alerting
slo-implementation SLO/SLI patterns, error budgets

GitOps & Documentation

Skill Description
implementing-gitops ArgoCD, Flux, GitOps workflows
guiding-users User guidance and onboarding
generating-documentation Auto-generate technical docs

Debugging

Skill Description
debug-cuda-crash CUDA debugging, GPU troubleshooting

Commands

Slash commands for common development tasks.

Command Description
/commit Create git commits with conventional format
/tech-debt Analyze and remediate technical debt
/refactor-clean Refactor code for quality and maintainability
/langchain-agent Create LangGraph-based agents
/prompt-optimize Optimize prompts for production LLMs

Workflow Examples

Troubleshoot Production Issues

1. use debugger to analyze error logs and stack traces
2. use devops-troubleshooter to check infrastructure
3. /debug-cuda-crash if GPU-related issues
4. use code-reviewer to identify root cause in code
5. /commit fix with conventional commit message

End-to-End LLM Deployment with K8s & Helm

A comprehensive workflow from ideation to production deployment of an LLM serving infrastructure.

Phase 1: Research & Ideation

1. /notebooklm query research notebooks for LLM serving best practices
2. /brainstorming explore deployment requirements and constraints

Phase 2: Architecture & Design

3. use docs-architect to create system design documentation
4. use architect-reviewer to validate architecture decisions

Phase 3: Planning

5. /writing-plans create implementation plan with TDD approach
6. /planning-disaster-recovery define RTO/RPO and backup strategies

Phase 4: Implementation

7. /vllm configure model serving with tensor parallelism
8. /high-performance-inference optimize with AWQ quantization
9. /k8s-manifest-generator create Deployment, Service, ConfigMap
10. /helm-chart-scaffolding or /helm-chart-vllm package as reusable Helm chart

Phase 5: Review & Quality

11. use code-reviewer to analyze security and performance
12. use refactoring-specialist if code improvements needed

Phase 6: Benchmarking & Debugging

13. Deploy to staging cluster with helm install
14. use devops-troubleshooter if pod issues occur
15. use debugger for application-level errors
16. /debug-cuda-crash if GPU-related issues
17. /aiperf-benchmark to benchmark performance of model

Phase 7: Documentation

18. /generating-documentation create deployment runbook and API docs
19. /commit document changes with conventional commit

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.