evaluation

Name: evaluation
Author: muratcankoylan

by muratcankoylan

This skill should be used when the user asks to \

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: muratcankoylan
Category: Testing
Views: 27

GitHub repo

About this skill

This skill should be used when the user asks to \

How to use

Aktywuj skill gdy potrzebujesz testować wydajność agenta, walidować wybory w inżynierii kontekstu lub mierzyć poprawy w czasie. Skill jest przeznaczony dla scenariuszy, gdzie agent podejmuje dynamiczne decyzje i może znaleźć alternatywne ścieżki do celu.
Zdefiniuj wymiary oceny dla Twojego agenta — typowe to: dokładność faktyczna, kompletność odpowiedzi, jakość źródeł, dokładność cytowań i efektywność użytych narzędzi. Każdy wymiar powinien mieć jasne kryteria.
Skonfiguruj rubryką ewaluacji, która uwzględnia, że agent może osiągnąć cel różnymi drogami — oceniaj wynik i rozsądność procesu, nie szukaj jednej "słusznej" odpowiedzi.
Wdrażaj ewaluację przez LLM-as-judge do skalowania testów, ale uzupełniaj ją ręczną weryfikacją dla przypadków brzegowych i walidacji krytycznych decyzji.
Uruchamiaj ewaluację regularnie przed wdrożeniami, aby wychwycić regresje i porównać różne konfiguracje agenta. Zbieraj metryki w czasie, aby śledzić trend poprawy.
Używaj wyników do budowania quality gates — ustaw progi akceptacji dla każdego wymiaru i blokuj wdrożenia, które ich nie spełniają.

Related skills

polymarket-trader

by openclaw

Query Polymarket prediction markets - trending events, crypto, politics, sports, and search

Testing

14142

playwright-browser-automation

by lackeyjb

Complete browser automation with Playwright. Auto-detects dev servers, writes clean test scripts to /tmp. Test pages, fill forms, take screenshots, check responsive design, validate UX, test login flows, check links, automate any browser task. Use when user wants to test

Testing

13130

langgraph-docs

by langchain-ai

Use this skill for requests related to LangGraph in order to fetch relevant documentation to provide accurate, up-to-date guidance.

Testing

23127

testing-workflow

by amo-tech-ai

Comprehensive testing workflow for E2E, integration, and unit tests. Use when testing applications layer-by-layer, validating user journeys, or running test suites.

Testing

1076

textual

by KyleKing

Expert guidance for building TUI (Text User Interface) applications with the Textual framework. Invoke when user asks about Textual development, TUI apps, widgets, screens, CSS styling, reactive programming, or testing Textual applications.

Testing

69192

pair-trade-screener

by tradermonty

Statistical arbitrage tool for identifying and analyzing pair trading opportunities. Detects cointegrated stock pairs within sectors, analyzes spread behavior, calculates z-scores, and provides entry/exit recommendations for market-neutral strategies. Use when user requests pair

Testing

994