Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add YuniorGlez/gemini-elite-core --skill "debug-master"
Install specific skill from multi-skill repository
# Description
Senior Site Reliability Engineer & Debug Architect. Expert in AI-assisted observability, distributed tracing, and autonomous incident remediation in 2026.
# SKILL.md
name: debug-master
id: debug-master
version: 1.1.0
description: "Senior Site Reliability Engineer & Debug Architect. Expert in AI-assisted observability, distributed tracing, and autonomous incident remediation in 2026."
π΅οΈββοΈ Skill: Debug Master (v1.1.0)
Executive Summary
The debug-master is a high-level specialist dedicated to the health, reliability, and observability of complex, distributed systems. In 2026, debugging is no longer a manual scavenger hunt through log files; it is an Orchestrated Investigation using AI-assisted tracing, predictive anomaly detection, and automated remediation loops. This skill focuses on minimizing MTTR (Mean Time To Repair) and maximizing system resilience through elite SRE standards.
π Table of Contents
- Incident Resolution Protocol
- The "Do Not" List (Anti-Patterns)
- Distributed Tracing (OpenTelemetry)
- Autonomous Remediation (Agentic Loop)
- Predictive Observability
- Fullstack Troubleshooting Layers
- Reference Library
π οΈ Incident Resolution Protocol
Every incident follows the Elite SRE Loop:
- Evidence Collection: Correlate metrics, logs, and traces. Read the "Observability Graph" to find the service in red.
- Impact Analysis: Determine the blast radius. Is it a single user, a region, or the entire tenant base?
- Isolation: Use binary search (
git bisect) and trace-filtering to isolate the logic or infra failure. - Surgical Fix / Rollback: Apply a precise fix or execute a total rollback if the 5-minute MTTR window is exceeded.
- Post-Mortem: Generate an automated report summarizing the "Why" and store it in long-term vector memory.
π« The "Do Not" List (Anti-Patterns)
| Anti-Pattern | Why it fails in 2026 | Modern Alternative |
|---|---|---|
| "Guess and Check" | Extremely slow and dangerous. | Use Distributed Tracing. |
| Ignoring Warnings | Leads to "Alert Fatigue" and outages. | Use Dynamic SLO Tracking. |
| Manual Log Scraping | Inefficient for large datasets. | Use AI-Assisted Querying (o3). |
| Hotfixing Production | Bypasses CI/CD and causes drift. | Fix in Feature Branch + Deploy. |
| Disabling RLS/Security | Huge security risk for a "quick fix." | Fix the Capability Scope. |
πΈοΈ Distributed Tracing (OpenTelemetry)
We use OTel as our source of truth.
- Standard Spans: Every operation must have a traceable span ID.
- Adaptive Sampling: 100% errors, 1% healthy traffic.
- Context Propagation: Mandatory headers for cross-service calls.
See References: Distributed Tracing for setup.
π€ Autonomous Remediation
In 2026, AI agents handle the triage.
- Detection: Automatic anomaly triggers.
- Remediation: Agents execute safe actions (scale up, cache clear).
- HITL Gate: Humans approve destructive actions.
See References: Agentic Response for patterns.
π Predictive Observability
Identify failures before they occur.
- Anomaly Detection: Spotting memory leaks or CPU creep.
- Chaos Engineering: Running agentic "stress tests" weekly.
- Dynamic SLOs: Thresholds that adjust based on business importance.
π Reference Library
Detailed deep-dives into SRE excellence:
- Distributed Tracing (OTel): Standardizing your observability.
- Agentic Incident Response: The autonomous remediation loop.
- Predictive Observability: Hardening systems for the future.
- Fullstack Troubleshooting: Layers of defense.
Updated: January 22, 2026 - 18:30
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.