evals

Name: evals
Author: danielmiessler

Agent evaluation framework based on Anthropic's best practices. USE WHEN eval, evaluate, test agent, benchmark, verify behavior, regression test, capability test. Includes three grader types (code-based, model-based, human), transcript capture, pass@k/pass^k metrics, and

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: danielmiessler
Category: Testing

GitHub repo

About this skill

How to use

Sprawdź dostępne dostosowania w katalogu ~/.claude/skills/PAI/USER/SKILLCUSTOMIZATIONS/Evals/. Jeśli katalog istnieje, załaduj plik PREFERENCES.md i wszelkie konfiguracje, które tam się znajdują — będą one zastępować domyślne ustawienia.
Aktywuj skill, używając jednej z poleceń: "uruchom evals", "testuj tego agenta", "oceń", "sprawdź jakość" lub "benchmark". Możesz także użyć "test regresji" lub "test możliwości".
Przygotuj transkrypt lub zapis wieloturowej rozmowy agenta, którą chcesz ocenić. Framework będzie analizować wywołania narzędzi i sekwencję interakcji.
Wybierz typ oceniającego odpowiedni do Twoich potrzeb: oceniający oparty na kodzie (automatyczne reguły), oparty na modelu (ocena przez AI) lub człowieka (ręczna weryfikacja).
Uruchom ocenę i przeanalizuj wyniki. Narzędzie wygeneruje metryki pass@k i pass^k, które pokażą wydajność agenta na poszczególnych zadaniach.
Jeśli znaleźliście problemy, możesz utworzyć nowe zadania oceny na podstawie niepowodzeń i powtórzyć proces walidacji przed wdrożeniem agenta.

Related skills

vitest

by antfu

Vitest fast unit testing framework powered by Vite with Jest-compatible API. Use when writing tests, mocking, configuring coverage, or working with test filtering and fixtures.

Testing

1236

qa-tester

by svilupp

Testing

2399

hono

by openstatusHQ

Efficiently develop Hono applications using Hono CLI. Supports documentation search, API reference lookup, request testing, and bundle optimization.

Testing

1257

lean4-theorem-proving

by cameronfreer

Use when developing Lean 4 proofs, facing type class synthesis errors, managing sorries/axioms, or searching mathlib - provides build-first workflow, instance management patterns (haveI/letI), and domain-specific tactics

Testing

9108

webapp-testing

by anthropics

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.

Testing

130255

langchain

by zechenzhangAGI

Framework for building LLM-powered applications with agents, chains, and RAG. Supports multiple providers (OpenAI, Anthropic, Google), 500+ integrations, ReAct agents, tool calling, memory management, and vector store retrieval. Use for building chatbots, question-answering

Testing

21123