pdf-processing

Name: pdf-processing
Author: Ming-Kai-LC

Comprehensive PDF processing techniques for handling large files that exceed Claude Code's reading limits, including chunking strategies, text/table extraction, and OCR for scanned documents. Use when working with PDFs larger than 10-15MB or more than 30-50 pages.

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: Ming-Kai-LC
Category: Data Science
Views: 134

GitHub repo

About this skill

How to use

Zainstaluj wymagane zależności: Python 3.8 lub nowszy, biblioteki PyPDF (≥3.0.0), PyMuPDF (≥1.23.0), pdfplumber (≥0.9.0), pdf2image (≥1.16.0) oraz pytesseract (≥0.3.10). Upewnij się, że masz dostęp do Tesseractu dla funkcjonalności OCR.
Przed przystąpieniem do pracy z plikiem PDF sprawdź, czy jego rozmiar nie przekracza bezpiecznych limitów. Użyj funkcji is_pdf_too_large() z dokumentacji – jeśli plik jest większy niż 10 MB, przejdź do kroku 3. Jeśli jest mniejszy, możesz odczytać go bezpośrednio za pomocą narzędzia Read w Claude'a.
Dla dużych plików zastosuj ekstrakcję tekstu za pomocą biblioteki PyMuPDF (fitz), która jest najszybsza. Funkcja extract_text_fast() przetwarza wszystkie strony i zwraca pełny tekst dokumentu bez ryzyka awarii sesji.
Jeśli dokument zawiera tabele lub wymaga precyzyjnej ekstrakcji strukturalnej, użyj biblioteki pdfplumber zamiast PyMuPDF – oferuje lepszą obsługę tabel i elementów strukturalnych.
Dla skanów lub dokumentów zawierających obrazy zamiast tekstu zastosuj OCR za pośrednictwem pytesseract. Najpierw konwertuj strony PDF na obrazy (pdf2image), a następnie uruchom rozpoznawanie tekstu.
Dla bardzo dużych plików (powyżej 50 stron) podziel PDF na mniejsze części przed ekstrakcją – technika chunking'u opisana w dokumentacji pozwala na przetwarzanie fragmentów bez przekroczenia limitów kontekstu Claude'a.

Related skills

prompt-optimizer

by solatis

Optimize system prompts for Claude Code agents using proven prompt engineering patterns. Use when users request prompt improvement, optimization, or refinement for agent workflows, tool instructions, or system behaviors.

Data Science

15109

web-artifacts-builder

by anthropics

Suite of tools for creating elaborate, multi-component claude.ai HTML artifacts using modern frontend web technologies (React, Tailwind CSS, shadcn/ui). Use for complex artifacts requiring state management, routing, or shadcn/ui components - not for simple single-file HTML/JSX

Data Science

37124

claude-automation-recommender

by anthropics

Analyze a codebase and recommend Claude Code automations (hooks, subagents, skills, plugins, MCP servers). Use when user asks for automation recommendations, wants to optimize their Claude Code setup, mentions improving Claude Code workflows, asks how to first set up Claude Code

Data Science

1787

pdf

by anthropics

Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.

Data Science

31144

skill-installer

by openai

Install Codex skills into $CODEX_HOME/skills from a curated list or a GitHub repo path. Use when a user asks to list installable skills, install a curated skill, or install a skill from another repo (including private repos).

Data Science

23118

market-research-reports

by davila7

Generate comprehensive market research reports (50+ pages) in the style of top consulting firms (McKinsey, BCG, Gartner). Features professional LaTeX formatting, extensive visual generation with scientific-schematics and generate-image, deep integration with research-lookup for

Data Science

16115