Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add yihangchen1205/mdp-designer-skill
Or install specific skill: npx add-skill https://github.com/yihangchen1205/mdp-designer-skill
# Description
Designs/edits MDP terms (observations, rewards, terminations, goals/commands, randomization) and wires them into configs and logging. Use when improving an RL environment’s MDP definition for better learning.
# SKILL.md
name: mdp-designer
description: Designs/edits MDP terms (observations, rewards, terminations, goals/commands, randomization) and wires them into configs and logging. Use when improving an RL environment’s MDP definition for better learning.
MDP Designer
Use this skill when a task requires changing what the policy sees (observations), optimizes (rewards), how goals are produced (commands/targets), how episodes end (terminations), or how physics/assets are perturbed (randomization).
This skill is framework-agnostic: it applies to Gymnasium-style envs, simulator-backed RL envs, and custom training loops as long as there is a well-defined observation/reward/termination interface.
What This Skill Produces
- A concrete list of files/functions to edit and the expected dataflow (config → env construction → MDP terms → rollout buffers/logging).
- Safe, minimal code changes that follow the target codebase’s existing patterns and abstractions.
- A sanity-check plan (smoke run / unit test / invariant checks) to catch silent failures.
Operating Procedure
- Locate the term boundary
- Find where the env computes and returns the core MDP outputs (commonly
obs,reward,terminated/truncated,info). - Identify whether the change belongs to:
- Observations (feature construction, normalization, history/stacking, sensors)
- Rewards (component definition, scaling, aggregation, per-component stats)
- Terminations (done logic, timeouts, safety constraints, reset conditions)
- Goals/commands (how targets are generated, curricula, reference trajectories)
- Randomization (domain randomization, noise injection, parameter perturbations)
-
Prefer changing the smallest “term module” rather than scattering logic across the env step loop.
-
Follow naming + logging conventions
- Ensure new reward components and diagnostics use stable, collision-free keys.
- Prefer emitting scalar or low-dimensional summaries (means, mins/maxes, rates) rather than raw tensors.
-
If the codebase has a logger convention (e.g.,
info["stats"], “episode/*”, TensorBoard keys), follow it exactly so metrics surface without extra glue. -
Make the change “config-driven”
- Add config knobs for weights, toggles, thresholds, and schedules (curriculum, annealing).
- Preserve backwards-compatible defaults unless the task explicitly wants behavior changes.
-
Keep config names and units explicit (meters vs centimeters, radians vs degrees).
-
Guardrails
- Validate shapes and dtypes at module boundaries (single env vs vectorized env; CPU vs GPU tensors).
- Ensure rewards are finite and well-scaled; clamp or safe-divide where division is possible.
- Keep termination conditions mutually consistent (avoid contradictory “never terminate” or “always terminate” behavior).
-
Avoid creating hidden state unless you reset it correctly on episode reset.
-
Minimal verification
- Add a fast smoke test (instantiate env, step N times, reset a few times) or reuse an existing test harness.
- Confirm
obsshapes match policy expectations, rewards are finite, and terminations occur at plausible rates. - Sanity-check key metrics move in the expected direction (e.g., increasing survival reward should not decrease episode length).
Common Pitfalls Checklist
- Observation mismatch between env output and policy input wiring (missing keys, wrong order, wrong normalization).
- Reward component is computed but never added to the scalar reward, or its weight has the wrong sign.
- Reward magnitudes are off by orders of magnitude, causing value loss instability or saturated policies.
- Diagnostics are emitted under unexpected keys so they never appear in logs.
- Termination triggers too frequently (learning collapses) or never (episodes never reset).
- Randomization is applied after values are cached/compiled, resulting in no effect.
How To Adapt To Any Codebase
- Start from the rollout boundary: where a batch of transitions is produced for training.
- Trace backward to the source of each term: observation construction, reward aggregation, termination checks, goal generation.
- Prefer “one responsibility per module” so term logic is testable without running long rollouts.
- Keep a short mapping doc: config knob → code location → logged metric.
Example Prompts (Copy/Paste)
- “Add a new reward term that penalizes lateral slip and log its component stats.”
- “Change termination to end the episode when the agent tilts beyond a threshold, configurable via config.”
- “Add an optional exteroceptive observation (e.g., height samples) behind a config flag.”
- “Refactor reward computation into components with clear scaling and unit tests for each component.”
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.