serving-llms-vllm

Name: serving-llms-vllm
Author: davila7

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8),

Installation

Pick a client and clone the repository into its skills directory.

Installation

Quick info

Author: davila7
Category: Security
Views: 6

GitHub repo

About this skill

How to use

Zainstaluj vLLM poleceniem pip install vllm. Upewnij się, że masz zainstalowane zależności: torch i transformers.
Aby uruchomić serwer kompatybilny z API OpenAI, wykonaj vllm serve meta-llama/Llama-3-8B-Instruct. Serwer będzie dostępny na http://localhost:8000/v1.
Wysyłaj zapytania do serwera za pomocą OpenAI SDK. Utwórz klienta z adresem http://localhost:8000/v1 i kluczem API ustawionym na 'EMPTY', następnie użyj client.chat.completions.create() z nazwą modelu i wiadomościami.
Dla wnioskowania offline bez serwera zaimportuj LLM i SamplingParams z vllm, załaduj model, ustaw parametry (temperatura, max_tokens), a następnie wywołaj llm.generate() z listą promptów.
W produkcji skonfiguruj ustawienia serwera w zależności od rozmiaru modelu (np. dla modeli 7B-13B na jednym GPU dostosuj parametry pamięci i batching'u).
Monitoruj metryki wydajności i przepustowości, aby upewnić się, że osiągasz oczekiwaną optymalizację latencji i wykorzystania zasobów GPU.

Related skills

manim

by davila7

Comprehensive guide for Manim Community - Python framework for creating mathematical animations and educational videos with programmatic control

Security

1588

payload

by payloadcms

Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.

Security

50171

windows-ui-automation

by martinholovsky

Security

10115

llama-cpp

by zechenzhangAGI

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Security

11252

google-analytics

by davila7

Analyze Google Analytics data, review website performance metrics, identify traffic patterns, and suggest data-driven improvements. Use when the user asks about analytics, website metrics, traffic analysis, conversion rates, user behavior, or performance optimization.

Security

1260

content-creator

by alirezarezvani

Create SEO-optimized marketing content with consistent brand voice. Includes brand voice analyzer, SEO optimizer, content frameworks, and social media templates. Use when writing blog posts, creating social media content, analyzing brand voice, optimizing SEO, planning content

Security

25124