sentencepiece

Name: sentencepiece
Author: davila7

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Security

GitHub repo

About this skill

How to use

Zainstaluj SentencePiece za pomocą pip: uruchom polecenie pip install sentencepiece w terminalu. Upewnij się, że masz zainstalowany Python 3.6 lub nowszy.
Przygotuj plik tekstowy zawierający dane treningowe (np. data.txt). Tekst powinien być surowy – SentencePiece sam obsługuje Unicode i nie wymaga wstępnego tokenizowania ani czyszczenia.
Wytrenuj model tokenizera za pomocą API Pythona: zaimportuj moduł sentencepiece, a następnie użyj SentencePieceTrainer.train() z parametrami: input='data.txt' (ścieżka do pliku), model_prefix='m' (prefiks nazwy modelu), vocab_size=8000 (rozmiar słownika – dostosuj do swoich potrzeb) i model_type='bpe' (algorytm BPE dla większości przypadków).
Po treningu otrzymasz dwa pliki: m.model (wytrenowany model) i m.vocab (słownik). Przechowuj je w bezpiecznym miejscu – będą potrzebne do tokenizacji.
Załaduj model i tokenizuj nowy tekst: zaimportuj sentencepiece, otwórz model poleceniem spm.SentencePieceProcessor() i metodą load() wskaż ścieżkę do m.model, następnie użyj encode() do konwersji tekstu na tokeny lub decode() do odwrotnej operacji.
Jeśli pracujesz z wieloma językami lub językami CJK, nie zmieniaj ustawień domyślnych – SentencePiece automatycznie obsługuje wszystkie znaki Unicode bez dodatkowej konfiguracji.

Related skills

gmail-manager

by jeffvincent

Manage Gmail - send, read, search emails, manage labels and drafts. Use when user wants to interact with their Gmail account for email operations.

Security

17128

architect-review

by sickn33

Master software architect specializing in modern architecture patterns, clean architecture, microservices, event-driven systems, and DDD. Reviews system designs and code changes for architectural integrity, scalability, and maintainability. Use PROACTIVELY for architectural

Security

2773

feishu-docs

by openclaw

飞书文档(Docx)API技能。用于创建、读取、更新和删除飞书文档。支持Markdown/HTML内容转换、文档权限管理。

Security

1574

better-auth-best-practices

by novuhq

Skill for integrating Better Auth - the comprehensive TypeScript authentication framework.

Security

1148

qmd

by tobi

Search personal markdown knowledge bases, notes, meeting transcripts, and documentation using QMD - a local hybrid search engine. Combines BM25 keyword search, vector semantic search, and LLM re-ranking. Use when users ask to search notes, find documents, look up information in

Security

1951

zendesk

by vm0-ai

Zendesk Support REST API for managing tickets, users, organizations, and support operations. Use this skill to create tickets, manage users, search, and automate customer support workflows.

Security

11100