quality-eval-design

by @miles-knowbl in Productivity

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "quality-eval-design"

Install specific skill from multi-skill repository

# Description

Design content and UX quality evaluation frameworks. Creates scoring rubrics, defines evaluation methodology, and produces eval documentation that can be used for ongoing quality assessment.

# SKILL.md

name: quality-eval-design
description: "Design content and UX quality evaluation frameworks. Creates scoring rubrics, defines evaluation methodology, and produces eval documentation that can be used for ongoing quality assessment."
phase: REVIEW
category: core
version: "1.0.0"
depends_on: [pipeline-discovery, ui-pipeline-discovery]
tags: [audit, quality, evaluation, content, ux]

Quality Eval Design

Design content and UX quality evaluation frameworks.

When to Use

After pipeline discovery — Runs in REVIEW phase once pipelines are known
Creating quality standards — Define what "good" looks like for this system
Enabling ongoing evaluation — Produce reusable eval frameworks
When you say: "design quality evals", "create scoring rubrics", "define quality standards"

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`content-dimensions.md`	Common content quality dimensions
`ux-dimensions.md`	UX quality dimensions

Read if applicable:

Reference	When Needed
`eval-methodology.md`	How to run evaluations

Verification: Eval frameworks are specific to what the system produces.

Required Deliverables

Deliverable	Location	Condition
`CONTENT-QUALITY-EVALS.md`	Project root	If system generates content
`UX-QUALITY-EVALS.md`	Project root	Always (UI exists)

Core Concept

Quality Eval Design answers: "What does 'good' look like for this system?"

Evaluations should be:
- System-specific — Tailored to what this system produces
- Measurable — 5-point scale with clear rubrics
- Evidence-based — Examples of each score level
- Reusable — Can be run repeatedly over time

Eval Design Process

┌─────────────────────────────────────────────────────────────┐
│               QUALITY EVAL DESIGN PROCESS                   │
│                                                             │
│  1. IDENTIFY OUTPUTS                                        │
│     └─→ What does this system produce?                     │
│                                                             │
│  2. DETERMINE CATEGORIES                                    │
│     ├─→ Content? (text, images, data)                      │
│     └─→ UX? (always yes for interactive systems)          │
│                                                             │
│  3. SELECT DIMENSIONS                                       │
│     └─→ What matters for each output type?                 │
│                                                             │
│  4. DEFINE WEIGHTS                                          │
│     └─→ How important is each dimension?                   │
│                                                             │
│  5. CREATE RUBRICS                                          │
│     └─→ What does 1-5 look like for each dimension?        │
│                                                             │
│  6. SET FLOORS                                              │
│     └─→ What's the minimum acceptable score?               │
│                                                             │
│  7. DOCUMENT EXAMPLES                                       │
│     └─→ Concrete examples of each score level              │
└─────────────────────────────────────────────────────────────┘

Content Quality Dimensions

Common dimensions for content-generating systems:

Dimension	Description	Applies To
voice_fidelity	Matches intended voice/persona	All generated text
topic_relevance	Addresses intended topic	All generated text
engagement	Interesting, holds attention	Social, marketing
accuracy	Factually correct	Informational
clarity	Easy to understand	Documentation, instructions
coherence	Logical flow and structure	Long-form content
originality	Fresh, not generic	Creative content

UX Quality Dimensions

Common dimensions for interactive systems:

Dimension	Description	Weight
responsiveness	UI responds quickly	25%
feedback_clarity	Clear loading/success/error states	25%
error_recovery	User can recover from errors	20%
state_consistency	UI shows current state	20%
accessibility	Keyboard, screen reader, contrast	10%

Rubric Template

### dimension_name
- **Weight:** NN%
- **Floor:** N.N
- **Description:** What this dimension measures

#### Scoring Rubric
| Score | Label | Definition |
|-------|-------|------------|
| 5 | Excellent | Exceeds expectations; delightful |
| 4 | Good | Meets expectations; minor issues |
| 3 | Acceptable | Functional; has rough edges |
| 2 | Poor | Below expectations; frustrating |
| 1 | Failed | Does not meet basic requirements |

#### Evidence Examples
- Score 5: "{Concrete example of excellent}"
- Score 3: "{Concrete example of acceptable}"
- Score 1: "{Concrete example of failed}"

Output Format

CONTENT-QUALITY-EVALS.md

# Content Quality Evaluations

Measures the quality of AI-generated content in [system name].

## Content Types

| Type | Pipeline | Description |
|------|----------|-------------|
| Twitter Thread | P2 | Generated tweet threads |
| Article | P2 | Long-form articles |
| Caption | P3 | Image captions |

## Dimensions

### voice_fidelity
- **Weight:** 40%
- **Floor:** 2.5
- **Description:** Generated content matches the intended voice and persona

#### Scoring Rubric
| Score | Label | Definition |
|-------|-------|------------|
| 5 | Excellent | Indistinguishable from target voice; natural, authentic |
| 4 | Good | Clearly in target voice; minor inconsistencies |
| 3 | Acceptable | Generally correct voice; occasionally generic |
| 2 | Poor | Voice inconsistent; often wrong |
| 1 | Failed | No resemblance to target voice |

#### Evidence Examples
- Score 5: "Thread captures creator's signature humor and catchphrases naturally"
- Score 3: "Thread is professional but lacks creator's distinctive style"
- Score 1: "Thread sounds like corporate press release despite casual persona"

[... more dimensions ...]

## Evaluation Methodology

1. Sample 5 outputs from each content type
2. Score each dimension 1-5
3. Calculate weighted average
4. Compare against quality gates

UX-QUALITY-EVALS.md

# UX Quality Evaluations

Measures the user experience of [system name].

## Dimensions

### responsiveness
- **Weight:** 25%
- **Floor:** 3.0
- **Description:** UI responds quickly to user actions

#### Scoring Rubric
| Score | Label | Definition |
|-------|-------|------------|
| 5 | Excellent | Instant response (<100ms); feels native |
| 4 | Good | Quick response (<500ms); smooth |
| 3 | Acceptable | Noticeable delay (<2s); usable |
| 2 | Poor | Slow response (2-5s); frustrating |
| 1 | Failed | Very slow or hangs (>5s) |

[... more dimensions ...]

## Evaluation Methodology

1. Walk through each U-series pipeline
2. Score each dimension at each step
3. Note specific issues as evidence
4. Calculate weighted average

Quality Gates

Define ship thresholds:

Weighted Score	Status	Action
>= 4.0	Ship	Ready to launch
3.0 - 4.0	Polish then ship	Address gaps first
< 3.0	Fix before ship	Significant issues

Validation Checklist

[ ] All system outputs identified
[ ] Relevant categories selected (content/UX)
[ ] Dimensions tailored to this system
[ ] Weights sum to 100% per category
[ ] Rubrics have clear score definitions
[ ] Floors set for each dimension
[ ] Examples provided where helpful
[ ] Methodology documented

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.