Search: evaluation | AgentSkillsRepo

agent-evaluation 0.00

404kidwiz / agent-skills-backup-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.00

ngxtm / devkit-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent ai automation claude

evaluating-candidates 0.00

RefoundAI / lenny-skills-evaluating-candidates exact

Help users make better hiring decisions. Use when someone is evaluating job candidates, making hiring decisions, conducting reference checks, reviewing work samples or take-homes, calibrating...

★ 30 ai

ai-agents ai-assistant claude claude-code

cofounder-evaluator 0.00

shipshitdev / library-cofounder-evaluator exact

Use this skill when users need to evaluate potential co-founders, assess founder compatibility, design equity splits, or navigate co-founder relationships. Activates for "should I work with this...

★ 4 ai

claude-code codex commands skills

evaluator 0.00

philoserf / claude-code-setup-evaluator exact

Quick structural validation of Claude Code customizations. Checks YAML syntax, required fields, naming conventions, and file organization. Use for fast correctness checks; use specialized *-audit...

★ 9 ai

claude-code

tech-stack-evaluator 0.00

matteocervelli / llms-tech-stack-evaluator exact

Auto-activates during requirements analysis to evaluate technical stack

★ 16 tools

evaluating-llms-harness 0.00

ovachiever / droid-tings-evaluating-llms-harness exact

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking...

★ 19 ai

evaluating-llms-harness 0.00

zechenzhangAGI / ai-research-skills-evaluating-llms-harness exact

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking...

★ 1,712 ai

ai ai-research claude claude-code

evaluating-code-models 0.00

zechenzhangAGI / ai-research-skills-evaluating-code-models exact

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language...

★ 1,712 ai

ai ai-research claude claude-code

evaluate-presets 0.00

mikeyobrien / ralph-orchestrator-evaluate-presets exact

Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.

★ 1,473 ai

ai ai-agents ai-agents-framework ai-developer-tools

llm-evaluation 0.00

phrazzld / claude-config-llm-evaluation exact

|

★ 2 ai

preop-evaluation 0.00

elijahc / agent-skills-preop-evaluation exact

A skill to help anesthesiologists risk stratify a patient according to their comorbidities and compute important perioperative variables

★ 0 tools

evaluating-trade-offs 0.00

RefoundAI / lenny-skills-evaluating-trade-offs exact

Help users make better decisions between competing options. Use when someone is weighing pros and cons, comparing alternatives, struggling with a difficult choice, deciding between speed and...

★ 30 ai

ai-agents ai-assistant claude claude-code

team-chemistry-evaluator 0.00

onewave-ai / claude-skills-team-chemistry-evaluator exact

Analyze roster fit and personality dynamics. Leadership assessment, role clarity, locker room culture, trade/signing impact.

★ 31 tools

eval 0.00

mikeyobrien / ralph-orchestrator-eval exact

EvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan...

★ 1,473 ai

ai ai-agents ai-agents-framework ai-developer-tools

research 0.00

samhvw8 / dot-claude-research exact

Technical research methodology with YAGNI/KISS/DRY principles. Phases: scope definition, information gathering, analysis, synthesis, recommendation. Capabilities: technology evaluation,...

★ 5 ai

planning 0.00

samhvw8 / dot-claude-planning exact

Technical implementation planning and architecture design. Capabilities: feature planning, system architecture, technical evaluation, implementation roadmaps, requirement breakdown, trade-off...

★ 5 ai

backend-design-review 0.00

DauQuangThanh / hanoi-rainbow-backend-design-review exact

Conducts comprehensive backend design reviews covering API design quality, database architecture validation, microservices patterns assessment, integration strategies evaluation, security design...

★ 2 web

agentic-ai ai process software-development

google-adk-python 0.00

samhvw8 / dot-claude-google-adk-python exact

Google Agent Development Kit (ADK) for Python. Capabilities: AI agent building, multi-agent systems, workflow agents (sequential/parallel/loop), tool integration (Google Search, Code Execution),...

★ 5 ai

peer-review 0.00

K-Dense-AI / claude-scientific-skills-peer-review exact

Structured manuscript/grant review with checklist-based evaluation. Use when writing formal peer reviews with specific criteria methodology assessment, statistical validity, reporting standards...

★ 6,907 development

ai-scientist bioinformatics chemoinformatics claude

Confirm

Submit a Skill