huggingface-tokenizers

Name: huggingface-tokenizers
Author: davila7

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in u003c20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers.

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Data Science
Views: 1

GitHub repo

About this skill

How to use

Zainstaluj bibliotekę tokenizers za pomocą pip install tokenizers. Jeśli planujesz pracować z modelami transformers, dodaj transformers do instalacji: pip install tokenizers transformers.
Załaduj pretrenowany tokenizer z HuggingFace Hub, importując klasę Tokenizer i wywołując metodę from_pretrained() z nazwą modelu, na przykład bert-base-uncased. Ta metoda pobierze konfigurację tokenizera z repozytorium.
Koduj tekst, przekazując ciąg znaków do metody encode() załadowanego tokenizera. Metoda zwraca obiekt zawierający listę tokenów (tokens) i ich identyfikatory numeryczne (ids).
Aby trenować własny tokenizer od zera, użyj klasy BpeTrainer, WordPieceTrainer lub UnigramTrainer w zależności od wybranego algorytmu. Przekaż pliki treningowe i parametry konfiguracyjne, takie jak rozmiar słownika.
Dla zaawansowanych przypadków użyj funkcji alignment tracking, aby śledzić mapowanie między tokenami a ich pozycjami w oryginalnym tekście – przydatne przy ekstrakcji informacji lub analizie tekstu.
Integruj tokenizer z pipelinami przetwarzania NLP, łącząc go z modelami transformers – biblioteka jest zoptymalizowana do pracy z tym ekosystemem.

Related skills

docx

by anthropics

Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content,

Data Science

39142

market-analysis

by xbklairith

Use when analyzing markets or interpreting charts - applies technical indicators (RSI, MACD, Moving Averages), identifies support/resistance, analyzes multi-timeframe trends, checks fundamentals and sentiment. Activates when user says \

Data Science

29144

skill-creator

by anthropics

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

Data Science

59147

data-storytelling

by wshobson

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

Data Science

26105

quant-analyst

by zenobi-us

Expert quantitative analyst specializing in financial modeling, algorithmic trading, and risk analytics. Masters statistical methods, derivatives pricing, and high-frequency trading with focus on mathematical rigor, performance optimization, and profitable strategy development.

Data Science

67217

pdf

by anthropics

Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.

Data Science

31144