evaluating-llms-harness

Name: evaluating-llms-harness
Author: davila7

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs.

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Security
Views: 27

GitHub repo

About this skill

How to use

Zainstaluj narzędzie za pomocą pip: pip install lm-eval. 2. Wybierz model do testowania — możesz użyć dowolnego modelu z HuggingFace, na przykład meta-llama/Llama-2-7b-hf. 3. Uruchom ewaluację na wybranych benchmarkach poleceniem lm_eval, podając nazwę modelu, jego parametry oraz listę zadań (na przykład mmlu, gsm8k, hellaswag). Określ również urządzenie (GPU) i rozmiar batcha dla wydajności. 4. Przeglądaj dostępne benchmarki poleceniem lm_eval --tasks list, aby wybrać te, które odpowiadają Twoim potrzebom — benchmarki rozumowania (MMLU, GSM8K, HellaSwag), benchmarki kodowania (HumanEval, MBPP) lub własny zestaw. 5. Czekaj na zakończenie ewaluacji — narzędzie obliczy wyniki dla każdego benchmarku i wyświetli metryki porównawcze. 6. Przeanalizuj wyniki, aby porównać modele, zidentyfikować słabe punkty lub zaraportować postęp treningu w publikacjach naukowych.

Related skills

manim

by davila7

Comprehensive guide for Manim Community - Python framework for creating mathematical animations and educational videos with programmatic control

Security

1588

feishu-docs

by openclaw

飞书文档(Docx)API技能。用于创建、读取、更新和删除飞书文档。支持Markdown/HTML内容转换、文档权限管理。

Security

1574

skill-writer

by pytorch

Guide users through creating Agent Skills for Claude Code. Use when the user wants to create, write, author, or design a new Skill, or needs help with SKILL.md files, frontmatter, or skill structure.

Security

15116

security-compliance

by davila7

Guides security professionals in implementing defense-in-depth security architectures, achieving compliance with industry frameworks (SOC2, ISO27001, GDPR, HIPAA), conducting threat modeling and risk assessments, managing security operations and incident response, and embedding

Security

1172

solidity-security

by wshobson

Master smart contract security best practices to prevent common vulnerabilities and implement secure Solidity patterns. Use when writing smart contracts, auditing existing contracts, or implementing security measures for blockchain applications.

Security

10105

llama-cpp

by zechenzhangAGI

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Security

11252