speculative-decoding
This Claude Code skill implements speculative decoding techniques to accelerate language model inference by 1.5 to 3.6 times without degrading output quality. Use it when deploying LLMs with latency constraints, such as real-time chatbots or code generation applications, or when running on hardware with limited computational resources. The skill covers draft model acceleration, Medusa multiple heads, and lookahead decoding with Jacobi iteration for production deployment.
git clone --depth 1 https://github.com/davila7/claude-code-templates /tmp/speculative-decoding && cp -r /tmp/speculative-decoding/cli-tool/components/skills/ai-research/emerging-techniques-speculative-decoding ~/.claude/skills/speculative-decodingSKILL.md
# Speculative Decoding: Accelerating LLM Inference
## When to Use This Skill
Use Speculative Decoding when you need to:
- **Speed up inference** by 1.5-3.6× without quality loss
- **Reduce latency** for real-time applications (chatbots, code generation)
- **Optimize throughput** for high-volume serving
- **Deploy efficiently** on limited hardware
- **Generate faster** without changing model architecture
**Key Techniques**: Draft model speculative decoding, Medusa (multiple heads), Lookahead Decoding (Jacobi iteration)
**Papers**: Medusa (arXiv 2401.10774), Lookahead Decoding (ICML 2024), Speculative Decoding Survey (ACL 2024)
## Installation
```bash
# Standard speculative decoding (transformers)
pip install transformers accelerate
# Medusa (multiple decoding heads)
git clone https://github.com/FasterDecoding/Medusa
cd Medusa
pip install -e .
# Lookahead Decoding
git clone https://github.com/hao-ai-lab/LookaheadDecoding
cd LookaheadDecoding
pip install -e .
# Optional: vLLM with speculative decoding
pip install vllm
```
## Quick Start
### Basic Speculative Decoding (Draft Model)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load target model (large, slow)
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
device_map="auto",
torch_dtype=torch.float16
)
# Load draft model (small, fast)
draft_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
# Generate with speculative decoding
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Transformers 4.36+ supports assisted generation
outputs = target_model.generate(
**inputs,
assistant_model=draft_model, # Enable speculative decoding
max_new_tokens=256,
do_sample=True,
temperature=0.7,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Medusa (Multiple Decoding Heads)
```python
from medusa.model.medusa_model import MedusaModel
# Load Medusa-enhanced model
model = MedusaModel.from_pretrained(
"FasterDecoding/medusa-vicuna-7b-v1.3", # Pre-trained with Medusa heads
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")
# Generate with Medusa (2-3× speedup)
prompt = "Write a Python function to calculate fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.medusa_generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
posterior_threshold=0.09, # Acceptance threshold
posterior_alpha=0.3, # Tree construction parameter
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
### Lookahead Decoding (Jacobi Iteration)
```python
from lookahead.lookahead_decoding import LookaheadDecoding
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Initialize lookahead decoding
lookahead = LookaheadDecoding(
model=model,
tokenizer=tokenizer,
window_size=15, # Lookahead window (W)
ngram_size=5, # N-gram size (N)
guess_size=5 # Number of parallel guesses
)
# Generate (1.5-2.3× speedup)
prompt = "Implement quicksort in Python:"
output = lookahead.generate(prompt, max_new_tokens=256)
print(output)
```
## Core Concepts
### 1. Speculative Decoding (Draft Model)
**Idea**: Use small draft model to generate candidates, large target model to verify in parallel.
**Algorithm**:
1. Draft model generates K tokens speculatively
2. Target model evaluates all K tokens in parallel (single forward pass)
3. Accept tokens where draft and target agree
4. Reject first disagreement, continue from there
```python
def speculative_decode(target_model, draft_model, prompt, K=4):
"""Speculative decoding algorithm."""
# 1. Generate K draft tokens
draft_tokens = draft_model.generate(prompt, max_new_tokens=K)
# 2. Target model evaluates all K tokens in one forward pass
target_logits = target_model(draft_tokens) # Parallel!
# 3. Accept/reject based on probability match
accepted = []
for i in range(K):
p_draft = softmax(draft_model.logits[i])
p_target = softmax(target_logits[i])
# Acceptance probability
if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
accepted.append(draft_tokens[i])
else:
break # Reject, resample from target
return accepted
```
**Performance**:
- Speedup: 1.5-2× with good draft model
- Zero quality loss (mathematically equivalent to target model)
- Best when draft model is 5-10× smaller than target
### 2. Medusa (Multiple Decoding Heads)
**Source**: arXiv 2401.10774 (2024)
**Innovation**: Add multiple prediction heads to existing model, predict future tokens without separate draft model.
**Architecture**:
```
Input → Base LLM (frozen) → Hidden State
├→ Head 1 (predicts token t+1)
├→ Head 2 (predicts token t+2)
├→ Head 3 (predicts token t+3)
└→ Head 4 (predicts token t+4)
```
**Training**:
- **Medusa-1**: Freeze base LLM, train only heads
- 2.2× speedup, lossless
- **Medusa-2**: Fine-tune base LLM + heads together
- 2.3-3.6× speedup, better quality
**Tree-based Attention**:
```python
# Medusa constructs tree of candidates
# Example: Predict 2 steps ahead with top-2 per step
# Root
# / \
# T1a T1b (Step 1: 2 candidates)
# / \ / \
# T2a T2b T2c T2d (Step 2: 4 candidates total)
# Single forward pass evaluates entire tree!
```
**AdvantUse this agent when creating specialized Claude Code agents for the claude-code-templates components system. Specializes in agent design, prompt engineering, domain expertise modeling, and agent best practices. Examples: <example>Context: User wants to create a new specialized agent. user: 'I need to create an agent that specializes in React performance optimization' assistant: 'I'll use the agent-expert agent to create a comprehensive React performance agent with proper domain expertise and practical examples' <commentary>Since the user needs to create a specialized agent, use the agent-expert agent for proper agent structure and implementation.</commentary></example> <example>Context: User needs help with agent prompt design. user: 'How do I create an agent that can handle both frontend and backend security?' assistant: 'Let me use the agent-expert agent to design a full-stack security agent with proper domain boundaries and expertise areas' <commentary>The user needs agent development help, so use the agent-expert agent.</commentary></example>
Use this agent to create blog articles for aitmpl.com from Claude Code Templates components. Reads the component, asks the user to confirm details, generates SVG cover, HTML article, and updates blog-articles.json. Examples: <example>Context: User wants a blog for a component. user: 'Create a blog article for cli-tool/components/hooks/security/secret-scanner.json' assistant: 'I'll use the blog-writer agent to create the full blog article with cover image and proper structure' <commentary>The user wants a blog article from a component, use blog-writer for the full pipeline.</commentary></example>
Runs pre-deploy build checks on the dashboard. Validates Astro build, checks for common esbuild/JSX issues, verifies API endpoints compile, and reports errors with fixes. Use before merging PRs that touch dashboard/.
Regenerates the component catalog (docs/components.json) by running the Python script. Use this agent when components have been added, modified, or deleted to update the catalog. Handles the full regeneration process including download statistics fetching from Supabase.
CLI interface design specialist. Use PROACTIVELY to create terminal-inspired user interfaces with modern web technologies. Expert in CLI aesthetics, terminal themes, and command-line UX patterns.
Use this agent when creating CLI commands for the claude-code-templates components system. Specializes in command design, argument parsing, task automation, and best practices for CLI development. Examples: <example>Context: User wants to create a new CLI command. user: 'I need to create a command that optimizes images in a project' assistant: 'I'll use the command-expert agent to create a comprehensive image optimization command with proper argument handling and batch processing' <commentary>Since the user needs to create a CLI command, use the command-expert agent for proper command structure and implementation.</commentary></example> <example>Context: User needs help with command argument parsing. user: 'How do I create a command that accepts multiple file patterns?' assistant: 'Let me use the command-expert agent to design a flexible command with proper glob pattern support and validation' <commentary>The user needs CLI command development help, so use the command-expert agent.</commentary></example>
Applies researched improvements to Claude Code components, validates changes with the component-reviewer agent, and creates pull requests. The only agent that modifies files and creates PRs.
Migrates components (agents, commands, skills, hooks, settings, MCPs) from external GitHub repositories to claude-code-templates, validates them with component-reviewer, and regenerates the catalog