Search: evaluation | AgentSkillsRepo

advanced-evaluation 0.00

itsAR-VR / goatedskills-advanced-evaluation exact

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or...

★ 0 ai

evaluation-rubrics 0.00

lyndonkl / claude-evaluation-rubrics exact

Use when need explicit quality criteria and scoring scales to evaluate work consistently, compare alternatives objectively, set acceptance thresholds, reduce subjective bias, or when user mentions...

★ 15 devops

agent-evaluation 0.00

omer-metin / skills-for-antigravity-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 5 ai

ai-agents antigravity antigravity-ide skills

dspy-evaluation-suite 0.00

OmidZamani / dspy-skills-dspy-evaluation-suite exact

This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1",...

★ 20 ai

agent-skills claude-code claude-skills dspy

evaluating-candidates 0.00

liqiongyu / lenny-skills-plus-evaluating-candidates exact

Make an evidence-based hiring decision and produce a Candidate Evaluation Decision Pack (criteria + scorecard, signal log, work sample/trial plan + rubric, reference check script + summary,...

★ 14 ai

agent-skills ai-agents automation claude

evaluating-new-technology 0.00

RefoundAI / lenny-skills-evaluating-new-technology exact

Help users evaluate emerging technologies. Use when someone is assessing new tools, making build vs buy decisions, evaluating AI vendors, or deciding on technical architecture.

★ 30 ai

ai-agents ai-assistant claude claude-code

nemo-evaluator-sdk 0.00

zechenzhangAGI / ai-research-skills-nemo-evaluator-sdk exact

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or...

★ 1,712 ai

ai ai-research claude claude-code

evaluating-new-technology 0.00

liqiongyu / lenny-skills-plus-evaluating-new-technology exact

Create a Technology Evaluation Pack (problem framing, options matrix, build vs buy, pilot plan, risk review, decision memo). Use for evaluating new tech, emerging technology, AI tools, vendor...

★ 14 ai

agent-skills ai-agents automation claude

hugging-face-evaluation 0.00

huggingface / skills-hugging-face-evaluation exact

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model...

★ 1,015 ai

evaluating-trade-offs 0.00

liqiongyu / lenny-skills-plus-evaluating-trade-offs exact

Evaluate trade-offs and produce a Trade-off Evaluation Pack (trade-off brief, options+criteria matrix, all-in cost/opportunity cost table, impact ranges, recommendation, stop/continue triggers)....

★ 14 ai

agent-skills ai-agents automation claude

Model Evaluator 0.00

eddiebe147 / claude-settings-model-evaluator exact

Evaluate and compare ML model performance with rigorous testing methodologies

★ 8 ai

hugging-face-evaluation-manager 0.00

eugenepyvovarov / mcpbundler-agent-skills-marketplace-hugging-face-evaluation-manager exact

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model...

★ 5 ai

agent-skill agent-skills claude codex

scholar-evaluation 0.00

K-Dense-AI / claude-scientific-skills-scholar-evaluation exact

Systematically evaluate scholarly work using the ScholarEval framework, providing structured assessment across research quality dimensions including problem formulation, methodology, analysis, and...

★ 6,907 ai

ai-scientist bioinformatics chemoinformatics claude

agent-evaluation 0.00

Ianfr13 / claude-code-plugins-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.00

automindtechnologie-jpg / ultimate-skill-md-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.00

404kidwiz / agent-skills-backup-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.00

ngxtm / devkit-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent ai automation claude

agent-evaluation 0.00

shishiv / gsd-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.00

halay08 / fullstack-agent-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.00

sickn33 / antigravity-awesome-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 2,844 ai

agentic-skills ai-agents antigravity autonomous-coding

Confirm

Submit a Skill