2145 results (24.2ms) page 2 / 108
itsAR-VR / goatedskills-advanced-evaluation exact

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or...

lyndonkl / claude-evaluation-rubrics exact

Use when need explicit quality criteria and scoring scales to evaluate work consistently, compare alternatives objectively, set acceptance thresholds, reduce subjective bias, or when user mentions...

omer-metin / skills-for-antigravity-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

OmidZamani / dspy-skills-dspy-evaluation-suite exact

This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1",...

liqiongyu / lenny-skills-plus-evaluating-candidates exact

Make an evidence-based hiring decision and produce a Candidate Evaluation Decision Pack (criteria + scorecard, signal log, work sample/trial plan + rubric, reference check script + summary,...

RefoundAI / lenny-skills-evaluating-new-technology exact

Help users evaluate emerging technologies. Use when someone is assessing new tools, making build vs buy decisions, evaluating AI vendors, or deciding on technical architecture.

zechenzhangAGI / ai-research-skills-nemo-evaluator-sdk exact

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or...

liqiongyu / lenny-skills-plus-evaluating-new-technology exact

Create a Technology Evaluation Pack (problem framing, options matrix, build vs buy, pilot plan, risk review, decision memo). Use for evaluating new tech, emerging technology, AI tools, vendor...

huggingface / skills-hugging-face-evaluation exact

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model...

liqiongyu / lenny-skills-plus-evaluating-trade-offs exact

Evaluate trade-offs and produce a Trade-off Evaluation Pack (trade-off brief, options+criteria matrix, all-in cost/opportunity cost table, impact ranges, recommendation, stop/continue triggers)....

eddiebe147 / claude-settings-model-evaluator exact

Evaluate and compare ML model performance with rigorous testing methodologies

eugenepyvovarov / mcpbundler-agent-skills-marketplace-hugging-face-evaluation-manager exact

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model...

K-Dense-AI / claude-scientific-skills-scholar-evaluation exact

Systematically evaluate scholarly work using the ScholarEval framework, providing structured assessment across research quality dimensions including problem formulation, methodology, analysis, and...

Ianfr13 / claude-code-plugins-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

automindtechnologie-jpg / ultimate-skill-md-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

404kidwiz / agent-skills-backup-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

ngxtm / devkit-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

shishiv / gsd-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

halay08 / fullstack-agent-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

sickn33 / antigravity-awesome-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...