promptfoo-evaluation

Name: promptfoo-evaluation
Author: daymade

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: daymade
Category: Testing
Views: 2

GitHub repo

About this skill

How to use

Zainstaluj Promptfoo, uruchamiając polecenie npx promptfoo@latest init w katalogu projektu. Narzędzie utworzy strukturę katalogów i plik promptfooconfig.yaml, który będzie podstawą Twojej konfiguracji.
Przygotuj prompty do testowania. Umieść je w katalogu prompts/ — mogą to być pliki Markdown (system.md) lub JSON (chat.json). W pliku promptfooconfig.yaml wskaż ścieżki do tych promptów w sekcji prompts.
Zdefiniuj modele do porównania w sekcji providers promptfooconfig.yaml. Możesz testować różne wersje Claude'a, GPT-4 lub inne dostępne modele, przypisując każdemu unikalny identyfikator i etykietę.
Przygotuj przypadki testowe w pliku tests/cases.yaml. Każdy przypadek powinien zawierać dane wejściowe i oczekiwane wyniki, które będą podstawą do oceny odpowiedzi modeli.
Dodaj niestandardowe metryki oceny. Napisz asercje w Pythonie (w pliku scripts/metrics.py) lub użyj wbudowanego llm-rubric do automatycznej oceny jakości. Skonfiguruj je w sekcji defaultTest promptfooconfig.yaml, ustawiając progi akceptacji (threshold).
Uruchom ewaluację poleceniem npx promptfoo@latest eval, a następnie wyświetl wyniki w przeglądarce za pomocą npx promptfoo@latest view. Porównaj wydajność modeli i zoptymalizuj prompty na podstawie otrzymanych wyników.

Related skills

polymarket-trader

by openclaw

Query Polymarket prediction markets - trending events, crypto, politics, sports, and search

Testing

14142

test-cases

by cexll

This skill should be used when generating comprehensive test cases from PRD documents or user requirements. Triggers when users request test case generation, QA planning, test scenario creation, or need structured test documentation. Produces detailed test cases covering

Testing

2862

code-review-excellence

by wshobson

Master effective code review practices to provide constructive feedback, catch bugs early, and foster knowledge sharing while maintaining team morale. Use when reviewing pull requests, establishing review standards, or mentoring developers.

Testing

1145

ad-creative

by alirezarezvani

When the user needs to generate, iterate, or scale ad creative for paid advertising. Use when they say 'write ad copy,' 'generate headlines,' 'create ad variations,' 'bulk creative,' 'iterate on ads,' 'ad copy validation,' 'RSA headlines,' 'Meta ad copy,' 'LinkedIn ad,' or

Testing

2863

hono

by openstatusHQ

Efficiently develop Hono applications using Hono CLI. Supports documentation search, API reference lookup, request testing, and bundle optimization.

Testing

1257

vitest

by antfu

Vitest fast unit testing framework powered by Vite with Jest-compatible API. Use when writing tests, mocking, configuring coverage, or working with test filtering and fixtures.

Testing

1236