mcncl

agent-troubleshooting

0
0
# Install this skill:
npx skills add mcncl/skill-buildkite --skill "agent-troubleshooting"

Install specific skill from multi-skill repository

# Description

|

# SKILL.md


name: agent-troubleshooting
description: |
Troubleshoots Buildkite agent issues. Use when user asks:
- "My build is stuck waiting for an agent"
- "Jobs aren't being picked up"
- "Why is my build stuck in scheduled?"
- "Agent not running my job"
- "Queue issues"
- "No agents available"


Agent Troubleshooting

Diagnose why jobs aren't being picked up by agents.

Available MCP Tools

Tool Purpose
buildkite_get_build Get job details including agent requirements
buildkite_list_clusters List available clusters
buildkite_list_cluster_queues List queues in a cluster
buildkite_get_cluster_queue Get queue stats (agent count, jobs waiting)

Input Parsing

User typically describes a symptom:

Input Likely Issue
"build stuck" Job in scheduled state
"waiting for agent" No matching agents
"job not starting" Agent configuration mismatch
"queue problem" Queue doesn't exist or no agents

Get the build number/URL to investigate.

Approach

  1. Get the build with buildkite_get_build
  2. Find the stuck job
  3. Note its state (scheduled, assigned, etc.)
  4. Extract agent query rules (queue, tags)

  5. Check cluster/queue configuration

  6. List clusters with buildkite_list_clusters
  7. List queues with buildkite_list_cluster_queues
  8. Get queue stats with buildkite_get_cluster_queue

  9. Compare requirements vs availability

  10. What does the job require?
  11. What agents/queues exist?
  12. Where's the mismatch?

  13. Provide diagnosis and fix

Job States for Agent Issues

State Meaning Indicates
scheduled Waiting for agent No matching agent available
assigned Agent accepted Agent has it but not starting
accepted Agent starting Should run soon

Jobs stuck in scheduled = agent matching problem.

Common Issues

1. Queue Mismatch

Symptom: Job stuck in scheduled
Cause: Job requires queue that doesn't exist or has no agents

# Pipeline requires:
agents:
  queue: "deploy"

# But no agents are in the "deploy" queue

Diagnosis:

Job requires: queue=deploy
Available queues: default (5 agents), build (10 agents)
❌ No "deploy" queue exists

Fix: Add agents to the deploy queue, or change pipeline to use existing queue.

2. Tag Mismatch

Symptom: Job stuck in scheduled
Cause: Job requires tags no agent has

# Pipeline requires:
agents:
  queue: "default"
  docker: "true"
  os: "linux"

# Agents have docker=true but os=macos

Diagnosis:

Job requires: queue=default, docker=true, os=linux
Available agents in default:
  - agent-1: docker=true, os=macos
  - agent-2: docker=true, os=macos
❌ No agent matches os=linux

Fix: Add Linux agents, or remove the os requirement.

3. No Agents Running

Symptom: Job stuck in scheduled
Cause: Queue exists but no agents connected

Diagnosis:

Job requires: queue=deploy
Queue "deploy" exists but has 0 connected agents

Fix: Start agents, check agent host health, verify network connectivity.

4. All Agents Busy

Symptom: Job stuck in scheduled longer than usual
Cause: Agents exist but at capacity

Diagnosis:

Job requires: queue=default
Queue "default": 3 agents, 15 jobs waiting
Average wait time: 12 minutes

Fix: Scale up agents, reduce parallelism, or wait.

5. Agent Assigned But Not Starting

Symptom: Job stuck in assigned state
Cause: Agent accepted job but can't start it

Possible causes:
- Agent hooks failing (environment, pre-command)
- Plugin installation failing
- Disk space issues
- Agent process problems

Fix: Check agent logs on the host machine.

Response Format

## Agent Issue Diagnosed

**Build**: #456
**Stuck Job**: "Run Tests"
**State**: scheduled (waiting for agent)

### Job Requirements
- Queue: `deploy`
- Tags: `docker=true`

### Available Resources
- Queue `deploy`: ❌ Does not exist
- Queue `default`: 5 agents (none match)

### Root Cause
The job requires `queue=deploy` but no such queue exists in your cluster.

### Fix
**Immediate**: Change the pipeline to use `queue=default`:
```yaml
agents:
  queue: "default"
  docker: "true"
```

**Long-term**: Create a `deploy` queue and add dedicated agents for deployments.

Diagnostic Commands

When explaining fixes, reference these Buildkite agent commands:

# Check agent status
buildkite-agent status

# See what queues/tags an agent has
buildkite-agent start --tags "queue=deploy,docker=true"

# Check agent logs
journalctl -u buildkite-agent

Example Interaction

User: My build is stuck waiting for an agent

1. Ask for build URL/number
2. Fetch build, find stuck job in "scheduled" state
3. Extract agent requirements: queue=special, gpu=true
4. List queues - "special" exists with 2 agents
5. Check queue details - agents have gpu=false
6. Explain: "Job needs gpu=true but queue agents don't have GPU tag"
7. Suggest: Add GPU agents or modify job requirements

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.