evaluating-code-models

Name: evaluating-code-models
Author: davila7

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Testing

GitHub repo

About this skill

How to use

Sklonuj repozytorium BigCode Evaluation Harness i przejdź do katalogu projektu: git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git, następnie cd bigcode-evaluation-harness.
Zainstaluj pakiet wraz z zależnościami (transformers ≥4.25.1, accelerate ≥0.13.2, datasets ≥2.6.1) poleceniem pip install -e . i skonfiguruj akcelerator: accelerate config.
Wybierz benchmark do testowania. Najczęściej używane to HumanEval (164 problemy kodowania), MBPP (500 zadań crowdsourcowanych) lub MultiPL-E (18 języków programowania). Listę wszystkich dostępnych zadań wyświetlisz poleceniem: python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)".
Uruchom ewaluację modelu na wybranym benchmarku. Przykład dla modelu starcoder2-7b na HumanEval: accelerate launch main.py --model bigcode/starcoder2-7b --tasks humaneval --max_length_generation 512 --temperature 0.2 --n_samples 20 --batch_size 10 --allow_code_execution --save_generations. Dostosuj parametry: model (nazwa modelu), tasks (benchmark), temperature (losowość), n_samples (liczba prób na problem).
Czekaj na zakończenie ewaluacji. Narzędzie wykonuje kod i mierzy pass@k — procent problemów rozwiązanych w co najmniej k próbach. Wyniki zapisywane są do pliku, jeśli użyjesz flagi --save_generations.
Przeanalizuj wyniki w wygenerowanym raporcie. Porównaj metryki pass@1, pass@10 lub pass@100 między modelami, aby ocenić ich zdolności do generowania poprawnego kodu na wybranych benchmarkach.

Related skills

powershell-windows

by davila7

PowerShell Windows patterns. Critical pitfalls, operator syntax, error handling.

Testing

1074

code-review-excellence

by wshobson

Master effective code review practices to provide constructive feedback, catch bugs early, and foster knowledge sharing while maintaining team morale. Use when reviewing pull requests, establishing review standards, or mentoring developers.

Testing

1145

code-reviewer

by google-gemini

Use this skill to review code. It supports both local changes (staged or working tree) and remote Pull Requests (by ID or URL). It focuses on correctness, maintainability, and adherence to project standards.

Testing

1248

qa-tester

by svilupp

Testing

2399

playwright-browser-automation

by lackeyjb

Complete browser automation with Playwright. Auto-detects dev servers, writes clean test scripts to /tmp. Test pages, fill forms, take screenshots, check responsive design, validate UX, test login flows, check links, automate any browser task. Use when user wants to test

Testing

13130

playwright-cli

by microsoft

Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.

Testing

45103