Search: benchmarks | AgentSkillsRepo

ready ~/ agentskillsrepo

login

40 results (10.6ms) page 1 / 2

agentifind-benchmark 0.30

AvivK5498 / agentifind-agentifind-benchmark exact

Create a benchmark to measure CODEBASE.md effectiveness. Sets up hooks to run two parallel agents (one with guide, one without) and compare their efficiency. Requires /agentifind to be run first.

★ 5 development

which-llm 0.18

richard-gyiko / which-llm-which-llm exact

Select optimal LLM(s) for a task based on skill requirements, budget, and constraints. Uses the `which-llm` CLI to query Artificial Analysis benchmarks enriched with capability data from models.dev.

★ 0 ai

agent-skill ai benchmarks cli

agent-evaluation 0.16

Ianfr13 / claude-code-plugins-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.16

ngxtm / devkit-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent ai automation claude

agent-evaluation 0.16

404kidwiz / agent-skills-backup-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.16

halay08 / fullstack-agent-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.16

ramidamolis-alt / agent-skills-workflows-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.16

automindtechnologie-jpg / ultimate-skill-md-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 0 ai

agent-evaluation 0.16

sickn33 / antigravity-awesome-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 2,844 ai

agentic-skills ai-agents antigravity autonomous-coding

agent-evaluation 0.16

cleodin / antigravity-awesome-skills-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 1 ai

agentic-skills ai-agents antigravity antigravity-ide

agent-evaluation 0.15

omer-metin / skills-for-antigravity-agent-evaluation exact

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world...

★ 5 ai

ai-agents antigravity antigravity-ide skills

performance-testing 0.15

cosmix / loom-performance-testing exact

Performance testing and load testing expertise including k6, locust, JMeter, Gatling, artillery, API load testing, database query optimization, benchmarking strategies, profiling techniques,...

★ 6 development

agentic-coding agents claude claude-code

evaluating-llms-harness 0.13

zechenzhangAGI / ai-research-skills-evaluating-llms-harness exact

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking...

★ 1,712 ai

ai ai-research claude claude-code

evaluating-llms-harness 0.13

ovachiever / droid-tings-evaluating-llms-harness exact

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking...

★ 19 ai

nemo-evaluator-sdk 0.13

zechenzhangAGI / ai-research-skills-nemo-evaluator-sdk exact

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or...

★ 1,712 ai

ai ai-research claude claude-code

evaluating-code-models 0.13

zechenzhangAGI / ai-research-skills-evaluating-code-models exact

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language...

★ 1,712 ai

ai ai-research claude claude-code

deepchem 0.13

K-Dense-AI / claude-scientific-skills-deepchem exact

Molecular ML with diverse featurizers and pre-built datasets. Use for property prediction (ADMET, toxicity) with traditional ML or GNNs when you want extensive featurization options and...

★ 6,907 ai

ai-scientist bioinformatics chemoinformatics claude

performance-monitor 0.13

404kidwiz / claude-supercode-skills-performance-monitor exact

Expert in observing, benchmarking, and optimizing AI agents. Specializes in token usage tracking, latency analysis, and quality evaluation metrics. Use when optimizing agent costs, measuring...

★ 6 ai

anysite-content-analytics 0.13

anysiteio / agent-skills-anysite-content-analytics exact

Track and analyze content performance across Instagram, YouTube, LinkedIn, Twitter/X, and Reddit using anysite MCP server. Measure engagement metrics, analyze post effectiveness, benchmark content...

★ 2 development

performance-thinker 0.13

omer-metin / skills-for-antigravity-performance-thinker exact

Performance optimization mindset - knowing when to optimize, how to measure, where bottlenecks hide, and when "fast enough" is the right answerUse when "slow, performance, optimize, profiling,...

★ 5 ai

ai-agents antigravity antigravity-ide skills