serving-llms-vllm
vLLM serves language models with 24x higher throughput than standard transformers through PagedAttention-based block KV cache management and continuous batching of prefill and decode requests. Use it for production LLM API deployments, reducing inference latency while maximizing GPU utilization, with built-in support for OpenAI-compatible endpoints, model quantization methods like AWQ and GPTQ, and multi-GPU tensor parallelism.
git clone --depth 1 https://github.com/moltis-org/moltis /tmp/serving-llms-vllm && cp -r /tmp/serving-llms-vllm/crates/skills/src/assets/mlops/inference/serving-llms-vllm ~/.claude/skills/serving-llms-vllmSKILL.md
# vLLM - High-Performance LLM Serving
## Quick start
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
**Installation**:
```bash
pip install vllm
```
**Basic offline inference**:
```python
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
```
**OpenAI-compatible server**:
```bash
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
```
## Common workflows
### Workflow 1: Production API deployment
Copy this checklist and track progress:
```
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
```
**Step 1: Configure server settings**
Choose configuration based on your model size:
```bash
# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
```
**Step 2: Test with limited traffic**
Run load test before production:
```bash
# Install load testing tool
pip install locust
# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000
```
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
**Step 3: Enable monitoring**
vLLM exposes Prometheus metrics on port 9090:
```bash
curl http://localhost:9090/metrics | grep vllm
```
Key metrics to monitor:
- `vllm:time_to_first_token_seconds` - Latency
- `vllm:num_requests_running` - Active requests
- `vllm:gpu_cache_usage_perc` - KV cache utilization
**Step 4: Deploy to production**
Use Docker for consistent deployment:
```bash
# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
```
**Step 5: Verify performance metrics**
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
### Workflow 2: Offline batch inference
For processing large datasets without server overhead.
Copy this checklist:
```
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
```
**Step 1: Prepare input data**
```python
# Load prompts from file
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
```
**Step 2: Configure LLM engine**
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
```
**Step 3: Run batch inference**
vLLM automatically batches requests for efficiency:
```python
# Process all prompts in one call
outputs = llm.generate(prompts, sampling)
# vLLM handles batching internally
# No need to manually chunk prompts
```
**Step 4: Process results**
```python
# Extract generated text
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# Save to file
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
```
### Workflow 3: Quantized model serving
Fit large models in limited GPU memory.
```
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
```
**Step 1: Choose quantization method**
- **AWQ**: Best for 70B models, minimal accuracy loss
- **GPTQ**: Wide model support, good compression
- **FP8**: Fastest on H100 GPUs
**Step 2: Find or create quantized model**
Use pre-quantized models from HuggingFace:
```bash
# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ
```
**Step 3: Launch with quantization flag**
```bash
# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAM
```
**Step 4: Verify accuracy**
Test outputs match expected quality:
```python
# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged
```
## When to use vs alternatives
**Use vLLM when:**
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
**Use alternatives instead:**
- **llama.cpp**: CPU/edge inference, single-user
- **HuggingFace transformers**: Research, prCommit all changes, push branch, create/update PR, and run local validation
Manage Apple Notes via the memo CLI on macOS (create, view, search, edit).
Manage Apple Reminders via remindctl CLI (list, add, complete, delete).
Track Apple devices and AirTags via FindMy.app on macOS using AppleScript and screen capture.
Send and receive iMessages/SMS via the imsg CLI on macOS.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.