40 results (79.4ms) page 1 / 2
cosmix / loom-model-evaluation exact

Evaluates machine learning models for performance, fairness, and reliability using appropriate metrics and validation techniques. Covers training debugging, hyperparameter tuning, and production...

eddiebe147 / claude-settings-model-evaluator exact

Evaluate and compare ML model performance with rigorous testing methodologies

zechenzhangAGI / ai-research-skills-evaluating-code-models exact

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language...

Kalyanikhandare29 / agent-skills-for-context-engineering-advanced-evaluation exact

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise...

Kalyanikhandare29 / agent-skills-for-context-engineering-evaluation exact

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge,...

huggingface / skills-hugging-face-evaluation exact

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model...

eugenepyvovarov / mcpbundler-agent-skills-marketplace-hugging-face-evaluation-manager exact

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model...

muratcankoylan / agent-skills-for-context-engineering-advanced-evaluation exact

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise...

guanyang / antigravity-skills-advanced-evaluation exact

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise...

muratcankoylan / agent-skills-for-context-engineering-evaluation exact

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge,...

guanyang / antigravity-skills-evaluation exact

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge,...

daymade / claude-code-skills-promptfoo-evaluation exact

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing...

itsAR-VR / goatedskills-advanced-evaluation exact

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or...

shipshitdev / library-advanced-evaluation exact

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or...

dkyazzentwatwa / chatgpt-skills-model-comparison-tool exact

Use when asked to compare multiple ML models, perform cross-validation, evaluate metrics, or select the best model for a classification/regression task.

zechenzhangAGI / ai-research-skills-evaluating-llms-harness exact

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking...

ovachiever / droid-tings-evaluating-llms-harness exact

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking...

shipshitdev / library-business-model-auditor exact

Use this skill when users need to stress test their business model, identify scale limitations, find bottlenecks, determine if they're trading time for money, or evaluate unit economics. Activates...

michaelboeding / skills-model-council exact

This skill should be used when the user asks for "model council", "multi-model", "compare models", "ask multiple AIs", "consensus across models", "run on different models", or wants to get...

shipshitdev / library-evaluation exact

Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.