agent-evaluation

Name: agent-evaluation
Author: davila7

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Testing
Views: 24

GitHub repo

About this skill

How to use

Zainstaluj umiejętność z repozytorium davila7 (claude-code-templates). Skill wymaga podstawowej wiedzy o testowaniu i fundamentach modeli językowych.
Zdefiniuj testy behawioralne dla swojego agenta — określ niezmienniki behawioralne, które agent powinien spełniać niezależnie od wariacji wejścia. Unikaj testów tylko ścieżki szczęśliwej; dodaj przypadki brzegowe i scenariusze awarii.
Uruchom testy wielokrotnie i analizuj rozkład wyników. Pojedynczy przebieg nie wystarczy — LLM agenty mogą dać różne odpowiedzi na to samo wejście. Zbierz statystyki z wielu uruchomień.
Przeprowadź testy adversarialne — aktywnie próbuj złamać zachowanie agenta. Nie polegaj na dopasowaniu stringów wyjścia; zamiast tego oceniaj semantykę i spełnienie zadania.
Monitoruj metryki niezawodności w produkcji. Zwróć uwagę na agenty, które dobrze wypadają na benchmarkach, ale zawodzą w rzeczywistych scenariuszach — to wskazuje na niedopasowanie między ewaluacją a rzeczywistym użyciem.
Unikaj przeciekania danych testowych do treningu lub promptów agenta. Oddziel dane ewaluacyjne od danych treningowych, aby uniknąć fałszywych pozytywnych wyników.

Related skills

creating-financial-models

by anthropics

This skill provides an advanced financial modeling suite with DCF analysis, sensitivity testing, Monte Carlo simulations, and scenario planning for investment decisions

Testing

25137

crypto-research

by stevengonsalvez

Comprehensive cryptocurrency market research and analysis using specialized AI agents. Analyzes market data, price trends, news sentiment, technical indicators, macro correlations, and investment opportunities. Use when researching cryptocurrencies, analyzing crypto markets,

Testing

14118

polymarket-trader

by openclaw

Query Polymarket prediction markets - trending events, crypto, politics, sports, and search

Testing

14142

performing-penetration-testing

by jeremylongshore

This skill enables automated penetration testing of web applications. It uses the penetration-tester plugin to identify vulnerabilities, including OWASP Top 10 threats, and suggests exploitation techniques. Use this skill when the user requests a \

Testing

1546

code-review-excellence

by wshobson

Master effective code review practices to provide constructive feedback, catch bugs early, and foster knowledge sharing while maintaining team morale. Use when reviewing pull requests, establishing review standards, or mentoring developers.

Testing

1145

playwright-cli

by microsoft

Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.

Testing

45103