nemo-curator

Name: nemo-curator
Author: davila7

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Security
Views: 1

GitHub repo

About this skill

How to use

Zainstaluj nemo-curator za pomocą pip. Dla przetwarzania tekstu z CUDA 12 uruchom: uv pip install "nemo-curator[text_cuda12]". Jeśli pracujesz ze wszystkimi typami mediów (tekst, obrazy, wideo, audio), użyj: uv pip install "nemo-curator[all_cuda12]". Dla środowiska CPU-only (wolniejsze) zainstaluj: uv pip install "nemo-curator[cpu]".
Przygotuj swoje dane w formacie DataFrame — załaduj tekst lub inne media do struktury danych, którą będziesz przetwarzać. Narzędzie pracuje z DocumentDataset, więc upewnij się, że twoje dane są w odpowiednim formacie.
Zdefiniuj funkcję filtrowania jakości dostosowaną do twoich potrzeb. Możesz wykorzystać wbudowane heurystyki jakości lub napisać własną logikę oceny dokumentów.
Zastosuj ScoreFilter do swojego zestawu danych, aby odfiltrować dokumenty niskiej jakości, zawierające dane osobowe lub treści NSFW. Narzędzie automatycznie skaluje przetwarzanie na dostępnych GPU.
Uruchom pipeline deduplikacji — rozmyta deduplikacja usuwa duplikaty nawet jeśli tekst nieznacznie się różni. Dla dużych zbiorów danych ta operacja będzie 16 razy szybsza niż na CPU.
Eksportuj oczyszczone dane do formatu wymaganego przez twój model treningowy. Narzędzie zwraca gotowe dane bez duplikatów, z usuniętymi danymi wrażliwymi i przefiltrowaną zawartością.

Related skills

manim

by davila7

Comprehensive guide for Manim Community - Python framework for creating mathematical animations and educational videos with programmatic control

Security

1588

llama-cpp

by zechenzhangAGI

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Security

11252

windows-ui-automation

by martinholovsky

Security

10115

brand-voice

by anthropics

Apply and enforce brand voice, style guide, and messaging pillars across content. Use when reviewing content for brand consistency, documenting a brand voice, adapting tone for different audiences, or checking terminology and style guide compliance.

Security

48158

ui-audit

by openclaw

AI skill for automated UI audits. Evaluate interfaces against proven UX principles for visual hierarchy, accessibility, cognitive load, navigation, and more. Based on Making UX Decisions by Tommy Geoco.

Security

1223

google-analytics

by davila7

Analyze Google Analytics data, review website performance metrics, identify traffic patterns, and suggest data-driven improvements. Use when the user asks about analytics, website metrics, traffic analysis, conversion rates, user behavior, or performance optimization.

Security

1260