evaluate-environments

Name: evaluate-environments
Author: PrimeIntellect-ai

Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: PrimeIntellect-ai
Category: Testing
Views: 1

GitHub repo

About this skill

How to use

Zainstaluj skill w swoim środowisku agenta Claude/Copilot, dodając go do konfiguracji MCP server'ów.
Uruchom smoke test na wybranym środowisku, aby szybko sprawdzić jego działanie: prime eval run my-env -m gpt-4.1-mini -n 5. Parametr -n określa liczbę próbek do testowania.
Jeśli testujesz środowisko z Hub'a, użyj ścieżki owner/env-slug zamiast lokalnej nazwy: prime eval run owner/my-env -m gpt-4.1-mini -n 5.
Po pozytywnym smoke teście skaluj ewaluację do większej liczby próbek i powtórzeń: prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s. Flaga -r określa liczbę powtórzeń, -s włącza shuffle.
Dla wygody zdefiniuj aliasy endpointów w pliku configs/endpoints.toml, aby uniknąć powtarzania parametrów URL i klucza API. Następnie odwołuj się do nich przez -m endpoint_id zamiast ręcznego wpisywania -b i -k.
Wyniki ewaluacji zapisują się automatycznie w Evaluations tab i lokalnie — możesz je przeglądać, porównywać modele oraz podejmować decyzje o następnych krokach na podstawie wygenerowanych podsumowań.

Related skills

nextjs-developer

by zenobi-us

Expert Next.js developer mastering Next.js 14+ with App Router and full-stack features. Specializes in server components, server actions, performance optimization, and production deployment with focus on building fast, SEO-friendly applications.

Testing

166226

performing-penetration-testing

by jeremylongshore

This skill enables automated penetration testing of web applications. It uses the penetration-tester plugin to identify vulnerabilities, including OWASP Top 10 threats, and suggests exploitation techniques. Use this skill when the user requests a \

Testing

1546

pair-trade-screener

by tradermonty

Statistical arbitrage tool for identifying and analyzing pair trading opportunities. Detects cointegrated stock pairs within sectors, analyzes spread behavior, calculates z-scores, and provides entry/exit recommendations for market-neutral strategies. Use when user requests pair

Testing

994

lean4-theorem-proving

by cameronfreer

Use when developing Lean 4 proofs, facing type class synthesis errors, managing sorries/axioms, or searching mathlib - provides build-first workflow, instance management patterns (haveI/letI), and domain-specific tactics

Testing

9108

test-cases

by cexll

This skill should be used when generating comprehensive test cases from PRD documents or user requirements. Triggers when users request test case generation, QA planning, test scenario creation, or need structured test documentation. Produces detailed test cases covering

Testing

2862

langgraph-docs

by langchain-ai

Use this skill for requests related to LangGraph in order to fetch relevant documentation to provide accurate, up-to-date guidance.

Testing

23127