Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add omarkilani/perf-hints
Or install specific skill: npx add-skill https://github.com/omarkilani/perf-hints
# Description
Improve C++ (C++11+) performance using Abseil “Performance Hints” (Jeff Dean & Sanjay Ghemawat): estimation + profiling, API/bulk design, algorithmic wins, cache-friendly memory layout, fewer allocations, fast paths, caching, and compiler-friendly hot loops. Use for performance code reviews, refactors, and profiling-driven optimizations. Keywords: performance, latency, throughput, cache, allocation, memory layout, InlinedVector, string_view, Span, flat_hash_map, pprof, perf.
# SKILL.md
name: perf-hints
description: 'Improve C++ (C++11+) performance using Abseil “Performance Hints” (Jeff Dean & Sanjay Ghemawat): estimation + profiling, API/bulk design, algorithmic wins, cache-friendly memory layout, fewer allocations, fast paths, caching, and compiler-friendly hot loops. Use for performance code reviews, refactors, and profiling-driven optimizations. Keywords: performance, latency, throughput, cache, allocation, memory layout, InlinedVector, string_view, Span, flat_hash_map, pprof, perf.'
license: Apache-2.0
compatibility: Skills-compatible coding agents working on modern C++ codebases; Abseil optional.
metadata:
upstream_url: "https://abseil.io/fast/hints"
upstream_authors: "Jeff Dean; Sanjay Ghemawat"
upstream_original_version: "2023-07-27"
upstream_last_updated: "2025-12-16"
C++ Performance Hints (Jeff Dean & Sanjay Ghemawat style)
This skill packages key ideas from Abseil’s Performance Hints document.
Use it to:
- review C++ code for performance risks
- propose high-impact optimizations with explicit tradeoffs
- design APIs/data structures that keep future optimizations possible
- write an experiment plan (profile + microbenchmark) to validate changes
Scope and guardrails
- Scope: single-process / single-binary performance (CPU, memory, allocations, cache behavior).
- Do not: change externally observable behavior unless the user explicitly agrees.
- Do not: introduce undefined behavior, data races, or brittle “clever” micro-opts without evidence.
- Default philosophy: choose the faster alternative when it doesn’t materially harm readability or complexity; otherwise, measure first.
When to apply
Use this skill when the task involves any of:
- reducing latency or improving throughput
- cutting memory footprint or allocation rate
- improving cache locality / reducing cache misses
- designing performant APIs (bulk ops, view types, threading model)
- reviewing performance-sensitive C++ changes
- interpreting a flat profile and finding next steps
What to ask for (minimum inputs)
If you don’t have enough information, ask for the smallest set that changes your recommendation quality:
- Goal: latency vs throughput vs memory (and the SLO if any)
- Where: hot path vs init vs test-only (and typical input sizes)
- Evidence: profile / flame graph / perf counters / allocation profile (if available)
- Constraints: correctness constraints, ABI/API constraints, thread-safety requirements
If none exists yet, proceed with static analysis + “what to measure first”.
Workflow: how an agent should use these hints
Step 1 — classify the code
- Test code: mostly care about asymptotic complexity and test runtime.
- Application code: separate init/cold vs hot path.
- Library code: prefer “safe, low-complexity” performance techniques because you can’t predict callers.
Step 2 — do a back-of-the-envelope estimate
Before implementing changes, estimate what might dominate:
- Count expensive operations (seeks, round-trips, allocations, bytes touched, comparisons, etc.)
- Multiply by rough cost.
- If latency matters and there is concurrency, consider overlap.
Reference latency table (rough order-of-magnitude)
| Operation | Approx time |
|---|---|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 3 ns |
| Branch mispredict | 5 ns |
| Mutex lock/unlock (uncontended) | 15 ns |
| Main memory reference | 50 ns |
| Compress 1K bytes with Snappy | 1,000 ns |
| Read 4KB from SSD | 20,000 ns |
| Round trip within same datacenter | 50,000 ns |
| Read 1MB sequentially from memory | 64,000 ns |
| Read 1MB over 100 Gbps network | 100,000 ns |
| Read 1MB from SSD | 1,000,000 ns (1 ms) |
| Disk seek | 5,000,000 ns (5 ms) |
| Read 1MB sequentially from disk | 10,000,000 ns (10 ms) |
| Send packet CA→Netherlands→CA | 150,000,000 ns (150 ms) |
Estimation examples (templates)
Quicksort a billion 4-byte integers (very rough):
- Memory transfer: ~30 passes × 4GB ÷ 16GB/s ≈ 7.5s
- Branch mispredicts: ~15B misses × 5ns ≈ 75s
- Total: ≈ 82.5s (dominant cost is mispredicts; cache refinements reduce memory-transfer term)
Web page with 30 thumbnails from 1MB images:
- On disk serial: 30 × (5ms seek + 10ms transfer) ≈ 450ms
- Spread across K disks in parallel: ≈ 450ms / K → if K≈100 ⇒ ~15ms (ignoring variance)
- Single SSD: 30 × (20µs + 1ms) ≈ 30ms
Step 3 — measure before paying complexity
When you can, measure to validate impact:
- Start with high-level CPU profiling (pprof) and consider perf for deeper counter-based analysis.
- Build optimized binaries with useful debug info.
- Prefer a microbenchmark when iterating on a specific routine, but treat microbenchmarks carefully (they can mislead).
- Watch for lock contention: contention can lower CPU usage and hide the “real” bottleneck.
Step 4 — pick the biggest lever first
Prioritize in this order unless evidence suggests otherwise:
- Algorithmic complexity wins (O(N²) → O(N log N) or O(N))
- Data structure choice / memory layout (cache locality; fewer cache lines)
- Allocation reduction (fewer allocs, better reuse)
- Avoid unnecessary work (fast paths, precompute, defer)
- Compiler friendliness (simplify hot loops, reduce abstraction overhead)
Step 5 — produce an LLM-friendly output
When you respond to the user, use this structure:
- Hot path hypothesis (what you think dominates, and why)
- Top issues (ranked): issue → evidence/estimate → proposed fix → expected impact
- Patch sketch (minimal code changes or pseudocode)
- Tradeoffs & risks (correctness, memory, ABI, complexity)
- Measurement plan (what to profile/benchmark and success criteria)
Techniques and patterns
1) API design for performance
Use bulk APIs to amortize overhead
When: callers do N similar operations (lookups, deletes, updates, decoding, locking).
Why: reduce boundary crossings and repeated fixed costs (locks, dispatch, decoding, syscalls).
Patterns:
- Prefer
DeleteRefs(Span<const Ref> refs)overDeleteRef(Ref)in a loop if it lets you lock once. - Provide bulk “build” APIs (e.g., Floyd heap construction is O(N) vs inserting N items one-at-a-time O(N log N)).
Pragmatic migration: if you can’t change callers quickly, use the bulk API internally and cache results for future non-bulk calls.
Prefer view types for function arguments
When: you don’t need ownership transfer.
Use:
std::string_view/absl::string_viewinstead ofconst std::string&when you can accept any string-backed buffer.absl::Span<T>orstd::span<T>for contiguous sequences.absl::FunctionRef<R(Args...)>for callbacks when you don’t need ownership.
Why: reduces copies and lets callers use efficient containers (including inlined/chunked types).
Accept pre-allocated / pre-computed arguments
When: a low-level routine is called frequently and it would otherwise allocate temporaries or recompute something the caller already has.
Pattern: add overloads that accept caller-owned scratch buffers or already-known timestamps/values.
Thread-compatible vs thread-safe types
- Default many generally-used types to thread-compatible (externally synchronized) so single-threaded or already-synchronized callers don’t pay for internal locking.
- If typical usage requires synchronization, internalize it so you can later optimize the mechanism (e.g., sharding to reduce contention) without changing callers.
2) Algorithmic improvements
The rare-but-massive wins.
Reduce complexity class
Common transformations:
- O(N²) → O(N log N) or O(N)
- O(N log N) sorted-list intersection → O(N) using a hash set
- O(log N) interval-map lookup → O(1) using hash lookup (if adjacency/coalescing can still be supported)
Patterns to look for (code review heuristics)
- nested loops over large N (especially over “all pairs”)
- repeated sorting/intersection where hashing would do
- maps/trees used for “lookup by key” when ordering is unused
- hash tables with poor hash functions (accidental O(N))
Concrete examples from the source document
- Graph initialization / cycle detection: instead of adding nodes/edges one-by-one to a cycle detector (expensive per edge), add the entire graph in reverse post-order so cycle detection becomes trivial.
- Deadlock detection: replacing a deadlock detection algorithm with a dynamic topological sort approach yields ~50× speedup and scales to very large graphs. In a DEBUG benchmark at the old 2K-node limit, the new algorithm is about 0.5µs per InsertEdge vs 22µs previously.
- IntervalMap → hash table: if coalescing adjacent blocks can be done via hash lookups, replace O(log N) lookup structures with O(1) hash lookups; the example mentions an allocator improving by ~4× (along with other changes).
- Sorted intersection → hash lookups: build a hash set for one list and probe with the other; an example compilation benchmark improved from ~28.5s to ~22.4s.
- Good hashing matters: invest in a hash function that avoids collisions and keeps expected operations O(1).
3) Better memory representation and cache locality
Memory layout often dominates when working sets are large.
Compact data structures
- Prefer compact representations for frequently accessed data or data comprising a large fraction of memory.
- Watch for cache-line contention when multiple threads write to nearby memory.
Memory layout checklist
For a frequently used struct/class:
- Reorder fields to reduce padding.
- Use smaller numeric types when safe.
- Use sized enums when appropriate:
enum class OpType : uint8_t { ... }. - Group fields commonly accessed together.
- Separate hot mutable fields from hot read-only fields to avoid eviction.
- Move cold data to the end or behind indirection.
- Consider bit/byte packing only inside well-tested modules and validate with benchmarks.
Indices instead of pointers
If you have pointer-heavy structures:
- consider using 32-bit indices into a contiguous
T[]storage - benefits: smaller references + better locality + fewer cache lines
Batched / flat storage
Avoid “one allocation per element” structures:
- prefer
std::vectorand flat/chunked hash sets/maps overstd::map/std::unordered_mapwhen possible - chunking can preserve asymptotics while improving cache footprint
Inlined storage
When containers are usually small and frequently constructed:
- use
absl::InlinedVector<T, N>(or similar small-buffer-optimized containers) - caveat: avoid if
sizeof(T)is large (inlined backing store becomes too big)
Maps: when to flatten vs when to nest
There are two opposite-but-related patterns:
1) Avoid unnecessarily nested maps by using a compound key (fewer allocations, better cache footprint):
btree_map<A, btree_map<B, C>>→btree_map<pair<A,B>, C>
2) Introduce hierarchy / nesting when a large first-level key is repeated across many entries:
- if a string “path” appears in ~1000 keys on average, a two-level map can reduce repeated string storage by ~1000× and improve locality
- an example reports a 76% microbenchmark improvement from splitting a single-level map into a two-level structure
Arenas
Arenas reduce allocation overhead and can pack related objects together:
- fewer cache lines touched
- less destruction overhead
- often a big win for complex object graphs
Caveat: don’t put many short-lived objects into long-lived arenas (memory bloat).
Arrays instead of maps
If the key domain is a small integer range or enum (or the map is tiny):
- replace maps with arrays/vectors
- e.g., payload type → clock rate as
int map[128]
Bit vectors / bit matrices instead of sets
If the key domain is representable as small integers:
- replace sets with a bit vector (e.g.,
InlinedBitVector<256>) - use bitwise ops for union/intersection
- for reachability/transitive relationships, consider a bit matrix after assigning dense IDs
One example replacing a set with a bit-vector reports ~26–32% improvement across various benchmark sizes.
4) Reduce allocations (and allocator overhead)
Allocations cost time in the allocator, touch new cache lines, and incur init/destruction overhead.
Avoid unnecessary allocations
Common patterns:
- use statically allocated buffers when size is usually bounded (e.g., static “zero buffers” up to a typical max)
- prefer stack allocation when lifetime is bounded by scope (watch stack size)
- avoid
shared_ptrwhen ownership is clear - for “empty” shared objects, consider reusing a static empty instance
Resize or reserve containers
When you know the expected max size:
- use
reserve()before repeatedpush_back()/emplace_back() - or
resize()and fill via pointer/index if that avoids repeated growth checks
Caveats:
- don’t grow one element at a time via repeated
reserve/resize(can go quadratic) - if construction is expensive, prefer
reserve+push_backoverresize(avoids double construction)
Avoid copying when possible
- prefer move (
std::move) for large structures - store pointers/indices in transient structures instead of copying large objects
- sort indices instead of sorting large objects when movement/copies are expensive
- prefer
std::sortoverstd::stable_sortunless stability is required
Example: avoiding an extra copy when receiving ~400KB tensors via gRPC improved a benchmark by ~10–15%.
Reuse temporary objects
Hoist loop-local allocations outside the loop:
- reuse
std::stringviaclear() - reuse protobuf objects via
Clear() - reuse serialization buffers by passing a caller-owned scratch string
Caveat: many containers grow to their max-ever size; periodically reconstruct after N uses if peak sizes are rare.
5) Avoid unnecessary work
Often the biggest category of wins after “big-O”.
Fast paths for common cases
Structure code so the common case stays in the I-cache and does minimal branching:
- extend fast ASCII scanning in UTF-8 parsing to handle trailing bytes
- fast path 1-byte varints; tail-call slow path for multi-byte
- skip error tracking loops when no errors occurred
- simplify InlinedVector resize paths (separate destroy vs in-place grow vs overflow)
- specialize common tensor initialization shapes (1D–4D)
Precompute expensive information
Patterns:
- precompute boolean flags or bitfields instead of recomputing predicates
- build small lookup tables (e.g., 256-entry byte tables)
- compute stats on demand instead of updating eagerly on hot operations
A simple example: reducing a preallocated pool size from 200 nodes to 10 cut a web server’s CPU usage by 7.5%.
Move expensive computations outside loops
- hoist bounds checks, shape/dimension lookups, invariant pointer fetches
Defer expensive computations until needed
- check cheap conditions first
- avoid computing expensive values for branches/operands that are rarely taken
Example: deferring an expensive GetSubSharding call until needed reduced CPU time from 43s to 2s in a workload.
Search order matters
If searching two tiers and the second is a subset of the first:
- it can be cheaper to search the larger tier first if success allows skipping the smaller tier
An example reports 19% throughput improvement from changing the search order.
Specialize hot code paths
- custom formatting/logging instead of general formatting
- prefix matching instead of regex
StrCat-style composition for known formats (e.g., IP addresses)- specialize logging levels (e.g., VLOG constants) to enable constant propagation
Cache to avoid repeated work
- cache results keyed by a precomputed fingerprint
- decode blocks once and cache decoded entries
6) Make the compiler’s job easier (hot loops)
Only do this in truly hot code.
Techniques:
- Avoid function calls in the hottest loops when they force frame setup.
- Move slow paths into separate tail-called functions.
- Copy small data into locals to improve alias analysis and auto-vectorization.
- Hand-unroll very hot loops (e.g., CRC processing 16 bytes at a time).
- Replace span/view abstractions with raw pointers inside the innermost loop when needed.
- In rare hot paths, replace
ABSL_LOG(FATAL)withABSL_DCHECK(false)to avoid frame setup costs.
7) Reduce stats and logging costs
Stats and logging can be surprisingly expensive in hot paths — even when "disabled."
Stats collection
- drop stats that aren't used
- prefer sampling over per-event accounting
- use power-of-two sampling with bitwise AND (
x & 31) rather than modulus
One example reduces alarm-setting time from 771 ns to 271 ns as part of a broader stats-related cleanup.
Logging in hot paths
Even disabled logging has cost:
VLOG(n)requires at least a load + compare on every call, even when the level is not enabled- the presence of logging code can inhibit compiler optimizations (larger stack frames, blocked inlining)
- formatting arguments may be evaluated even if the log is ultimately suppressed
Mitigations:
- Remove logging entirely from innermost hot loops when possible
- Precompute
VLOG_IS_ON(n)outside the loop and pass as a parameter to helper functions - Avoid building log message strings in hot paths; if you must log, gate the entire block:
cpp if (VLOG_IS_ON(2)) { VLOG(2) << ExpensiveDebugString(); }
8) Code size considerations
Large generated code increases compile/link time, icache pressure, and memory footprint.
Why it matters
- more icache misses in hot paths
- slower builds and larger binaries
- can push hot code out of L1 instruction cache
Techniques
- Small inline fast path +
NOINLINEslow path: keep the common case tiny; move complex/rare handling to a separate function markedABSL_ATTRIBUTE_NOINLINEor[[gnu::noinline]]. - Specialize for compile-time constants: e.g., VLOG levels 0–5 can have dedicated
SlowIsEnabled0()…SlowIsEnabled5()functions, avoiding an extra integer parameter at each call site. - Avoid heavy template instantiation in headers included by many translation units.
- Watch macro expansion: logging and assertion macros can silently bloat code.
Flat-profile playbook
If no single hotspot dominates:
- Don’t discount many small wins (twenty 1% improvements can matter).
- Look for loops closer to the top of call stacks (flame graphs help).
- Consider structural refactors (one-shot construction instead of incremental mutation).
- Replace overly general abstractions with specialized code.
- Reduce allocations (allocation profiles help).
- Use hardware counters (cache misses, branch misses) to find invisible costs.
Examples (inputs → outputs)
These are intentionally small; in real work, pair them with a profile or a microbenchmark.
Example A — avoid repeated vector growth
Input (common anti-pattern):
std::vector<Widget> out;
for (const auto& x : xs) {
out.push_back(MakeWidget(x));
}
Agent output (sketch):
- Issue: repeated growth checks and reallocations possible.
- Fix:
out.reserve(xs.size())ifxs.size()is known/cheap; orout.resize(n)+ fill by index/pointer if it simplifies hot code. - Expected impact: fewer reallocations, fewer copies/moves, better cache behavior.
- Measurement: microbench the loop for realistic
xssizes; also check allocation profile.
Example B — “nested map” choice depends on key duplication
Input: a single-level hash map keyed by (path_string, subkey_int) where each path_string repeats many times.
Agent output (sketch):
- Hypothesis: the repeated
path_stringdominates memory and cache footprint. - Fix: split into
map<path_string, map<subkey_int, Value>>so the path is stored once per path. - Why: reduces repeated storage and can improve locality when accessing many subkeys for the same path.
- Caveat: if the first-level key is huge and rarely repeats, the added indirection may not help.
Example C — defer expensive computation
Input: compute an expensive value for every loop iteration, but only a small subset uses it.
Agent output (sketch):
- Fix: restructure to check the cheap condition first, then compute the expensive value only on the matching path.
- Expected impact: proportional to the fraction of iterations that skip the expensive work.
- Measurement: add a microbenchmark or count how often the condition is true in production.
Quick review checklist
When reviewing a performance-sensitive change, scan for:
- [ ] Any O(N²) behavior on realistic N?
- [ ] Repeated sorting/intersection where hashing would do?
- [ ] Hash tables with weak hash functions or pathological keys?
- [ ] Per-element allocations (
newinside loops, per-node maps/sets)? - [ ] Missing
reserve()/ avoidableresize()patterns? - [ ] Unnecessary copying of large objects (missing
std::move, passing by value)? - [ ] “Cold” data living next to “hot” data in structs?
- [ ] Pointer-heavy graphs that could be index-based arrays?
- [ ] Unnecessary locking or lock contention (thread-safe when thread-compatible would do)?
- [ ] Logging/stats in the hottest loops? (even disabled
VLOGhas load+compare cost) - [ ] Missing fast paths for common cases?
- [ ] Repeated expensive computation that could be cached, hoisted, or deferred?
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.