debug-distributed

Name: debug-distributed
Author: inclusionAI

Guide for debugging distributed training issues in AReaL. Use when user encounters hangs, wrong results, OOM, or communication errors.

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: inclusionAI
Category: Backend
Views: 2

GitHub repo

About this skill

Guide for debugging distributed training issues in AReaL. Use when user encounters hangs, wrong results, OOM, or communication errors.

How to use

Uruchom umiejętność, gdy napotkasz problemy w trenowaniu rozproszonym — zawieszenia, różne wyniki na węzłach, błędy OOM lub problemy komunikacyjne NCCL. 2. Zastosuj zasadę minimalnej reprodukcji: utwórz najmniejszy skrypt testowy, który odtwarza problem, usuwając niepowiązane komponenty modelu, zmniejszając rozmiary tensorów i redukując liczbę GPU do minimum (np. 2 urządzenia). 3. Włącz szczegółowe logowanie poprzez zmienne środowiskowe: ustaw TORCH_DISTRIBUTED_DEBUG=DETAIL, NCCL_DEBUG=INFO i NCCL_DEBUG_SUBSYS=ALL, aby uzyskać pełne informacje diagnostyczne. 4. Jeśli proces się zawiesza, użyj py-spy do zrzutu stosu wywołań — znajdź identyfikator procesu poleceniem ps aux, a następnie wykonaj py-spy dump --pid [PID] lub py-spy record -o profile.svg --pid [PID] --duration 30 dla analizy wydajności. 5. Sprawdź typowe przyczyny: niezgodne kolektywne operacje (jeden węzeł wywołuje all_reduce, inny nie), błędne grupy procesów, lub konflikty w torch.compile — porównaj kod na wszystkich węzłach, aby upewnić się, że wszystkie procesy wykonują identyczne operacje rozproszone.

Related skills

literature-review

by K-Dense-AI

Conduct comprehensive, systematic literature reviews using multiple academic databases (PubMed, arXiv, bioRxiv, Semantic Scholar, etc.). This skill should be used when conducting systematic literature reviews, meta-analyses, research synthesis, or comprehensive literature

Backend

238507

drizzle

by lobehub

Drizzle ORM schema and database guide. Use when working with database schemas (src/database/schemas/*), defining tables, creating migrations, or database model code. Triggers on Drizzle schema definition, database migrations, or ORM usage questions.

Backend

79340

youtube-transcript

by michalparkola

Download YouTube video transcripts when user provides a YouTube URL or asks to download/get/fetch a transcript from YouTube. Also use when user wants to transcribe or get captions/subtitles from a YouTube video.

Backend

53214

video-downloader

by ComposioHQ

Downloads videos from YouTube and other platforms for offline viewing, editing, or archival. Handles various formats and quality options.

Backend

50173

travel-planner

by ailabs-393

This skill should be used whenever users need help planning trips, creating travel itineraries, managing travel budgets, or seeking destination advice. On first use, collects comprehensive travel preferences including budget level, travel style, interests, and dietary

Backend

4379

find-skills

by openstatusHQ

Helps users discover and install agent skills when they ask questions like \

Backend

150111