failure-mode-analysis

by @miles-knowbl in AI & LLM

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "failure-mode-analysis"

Install specific skill from multi-skill repository

# Description

Apply MECE failure mode taxonomy to backend pipelines. Classifies failures by Location (L1-L4), Type (T1-T5), and Severity (S1-S4). Produces comprehensive failure mode inventory with validation status and coverage calculation.

# SKILL.md

name: failure-mode-analysis
description: "Apply MECE failure mode taxonomy to backend pipelines. Classifies failures by Location (L1-L4), Type (T1-T5), and Severity (S1-S4). Produces comprehensive failure mode inventory with validation status and coverage calculation."
phase: REVIEW
category: core
version: "1.0.0"
depends_on: [pipeline-discovery]
tags: [audit, failure-modes, mece, backend, analysis]

Failure Mode Analysis

Apply MECE failure mode taxonomy to backend pipelines.

When to Use

After pipeline discovery — Runs in REVIEW phase on identified P-series
Systematic failure identification — Find all ways each pipeline can fail
Coverage assessment — Calculate what percentage of failures are validated
When you say: "analyze failure modes", "what can break?", "find the risks"

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`mece-taxonomy.md`	The three classification dimensions
`failure-mode-template.md`	How to document each failure mode

Read if applicable:

Reference	When Needed
`validation-checking.md`	How to verify if mode is validated

Verification: Every pipeline step has been analyzed for failure modes.

Required Deliverables

Deliverable	Location	Condition
`PIPELINE-FAILURE-MODES.md`	Project root	Always
Coverage update	`audit-state.json`	Always

Core Concept

Failure Mode Analysis answers: "What are all the ways this pipeline can fail?"

MECE = Mutually Exclusive, Collectively Exhaustive
- Every failure belongs to exactly one category
- All possible failures are covered

The three dimensions:
1. Location (WHERE) — Where in the pipeline does it fail?
2. Type (WHAT) — What kind of failure is it?
3. Severity (HOW BAD) — How serious is the impact?

The MECE Taxonomy

Dimension 1: Location (WHERE)

Code	Location	Description	Examples
L1	Input	Data entering pipeline is invalid/missing	Bad request, missing field
L2	Processing	Transformation logic fails	Algorithm error, edge case
L3	Output	Storage or delivery fails	DB write error, API failure
L4	Integration	Cross-pipeline handoff fails	Missing dependency, format mismatch

Dimension 2: Type (WHAT)

Code	Type	Description	Examples
T1	Data	Missing, malformed, stale data	Null field, wrong format
T2	Logic	Algorithm error, wrong branch	Edge case, off-by-one
T3	Infrastructure	Auth, network, timeout	Rate limit, connection lost
T4	Quality	Succeeds but output is bad	Low quality, wrong tone
T5	UX	Works but experience is broken	Confusing, no feedback

Dimension 3: Severity (HOW BAD)

Code	Severity	Description	User Impact
S1	Silent	Fails without error	Bad data persists, user unaware
S2	Partial	Some data correct, some wrong	Partial results, confusing
S3	Visible	Error surfaced but unclear	User sees error, doesn't understand
S4	Blocking	Cannot complete	User stuck, needs help

Analysis Process

┌─────────────────────────────────────────────────────────────┐
│              FAILURE MODE ANALYSIS PROCESS                  │
│                                                             │
│  FOR EACH PIPELINE:                                         │
│                                                             │
│  1. WALK THROUGH EACH STEP                                  │
│     └─→ What can go wrong at this step?                    │
│                                                             │
│  2. FOR EACH POTENTIAL FAILURE                              │
│     ├─→ Assign Location (L1-L4)                            │
│     ├─→ Assign Type (T1-T5)                                │
│     ├─→ Assign Severity (S1-S4)                            │
│     └─→ Generate ID: P{N}-{NNN}                            │
│                                                             │
│  3. CHECK VALIDATION STATUS                                 │
│     ├─→ Is there explicit error handling?                  │
│     ├─→ Is there test coverage?                            │
│     └─→ Mark VALIDATED or UNVALIDATED                      │
│                                                             │
│  4. CALCULATE COVERAGE                                      │
│     └─→ (Validated / Total) × 100%                         │
└─────────────────────────────────────────────────────────────┘

Failure Mode Documentation

### P2-007: Template selection too narrow

| Attribute | Value |
|-----------|-------|
| **ID** | P2-007 |
| **Pipeline** | P2: Content Generation |
| **Step** | 4. Template selection |
| **Location** | L2-Processing |
| **Type** | T4-Quality |
| **Severity** | S1-Silent |
| **Description** | Only 3 templates available, causing repetitive output |
| **Impact** | Generated content lacks variety |
| **Detection** | None (silent) |
| **Status** | UNVALIDATED |
| **Test Spec** | TEST-P2-007 |
| **Fix** | Expand template pool to 10+ variations |
| **Effort** | M |

Output Format

PIPELINE-FAILURE-MODES.md

# Pipeline Failure Modes

## Summary

| Pipeline | Failure Modes | Validated | Coverage |
|----------|---------------|-----------|----------|
| P1: Source Ingestion | 9 | 4 | 44% |
| P2: Content Generation | 12 | 3 | 25% |
| P3: Publishing | 8 | 5 | 63% |
| **Total** | **29** | **12** | **41%** |

## Risk Distribution

| Severity | Count | Validated | Risk Level |
|----------|-------|-----------|------------|
| S1-Silent | 8 | 2 | HIGHEST |
| S2-Partial | 7 | 3 | HIGH |
| S3-Visible | 9 | 5 | MEDIUM |
| S4-Blocking | 5 | 2 | LOW |

## P1: Source Ingestion

### P1-001: File type not supported

| Attribute | Value |
|-----------|-------|
| **Location** | L1-Input |
| **Type** | T1-Data |
| **Severity** | S3-Visible |
| **Description** | User uploads unsupported file type |
| **Status** | VALIDATED (try-catch in upload handler) |

[... more failure modes ...]

Coverage Calculation

A failure mode is Validated only if:

Explicit handling exists:
typescript try { await processFile(file); } catch (error) { // Specific handling for this failure }
Test coverage exists:
typescript test('handles unsupported file type', () => { expect(() => upload(badFile)).toThrow(); });

Coverage formula:

Coverage = (Validated Modes / Total Modes) × 100%

Priority Order for Fixes

S1 (Silent) — Highest risk: fail without anyone knowing
L4 (Integration) — High risk: cross-pipeline, often 0% validated
S2 (Partial) — Medium risk: data corruption possible
S3 (Visible) — Lower risk: at least user knows
S4 (Blocking) — Usually already handled (obvious failures)

Validation Checklist

[ ] Every pipeline step analyzed
[ ] Each failure mode has L/T/S codes
[ ] Unique ID assigned (P{N}-{NNN})
[ ] Validation status checked against code
[ ] Coverage percentages calculated
[ ] Summary table generated

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.