awq-quantization
AWQ (Activation-aware Weight Quantization) is a 4-bit compression technique that preserves important weights based on activation patterns, delivering 3x inference speedup with minimal accuracy loss. Use AWQ when deploying 7B to 70B parameter models on memory-constrained GPUs, requiring fast inference on instruction-tuned or multimodal models, and working with modern hardware like A100 or H100 GPUs that support Marlin kernels.
git clone --depth 1 https://github.com/Orchestra-Research/AI-Research-SKILLs /tmp/awq-quantization && cp -r /tmp/awq-quantization/10-optimization/awq ~/.claude/skills/awq-quantizationSKILL.md
# AWQ (Activation-aware Weight Quantization)
4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.
## When to use AWQ
**Use AWQ when:**
- Need 4-bit quantization with <5% accuracy loss
- Deploying instruction-tuned or chat models (AWQ generalizes better)
- Want ~2.5-3x inference speedup over FP16
- Using vLLM for production serving
- Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support
**Use GPTQ instead when:**
- Need maximum ecosystem compatibility (more tools support GPTQ)
- Working with ExLlamaV2 backend specifically
- Have older GPUs without Marlin support
**Use bitsandbytes instead when:**
- Need zero calibration overhead (quantize on-the-fly)
- Want to fine-tune with QLoRA
- Prefer simpler integration
## Quick start
### Installation
```bash
# Default (Triton kernels)
pip install autoawq
# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]
# Intel CPU/XPU optimization
pip install autoawq[cpu]
```
**Requirements**: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+
### Load pre-quantized model
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Quantize your own model
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantization config
quant_config = {
"zero_point": True, # Use zero-point quantization
"q_group_size": 128, # Group size (128 recommended)
"w_bit": 4, # 4-bit weights
"version": "GEMM" # GEMM for batch, GEMV for single-token
}
# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
```
**Timing**: ~10-15 min for 7B, ~1 hour for 70B models.
## AWQ vs GPTQ vs bitsandbytes
| Feature | AWQ | GPTQ | bitsandbytes |
|---------|-----|------|--------------|
| **Speedup (4-bit)** | ~2.5-3x | ~2x | ~1.5x |
| **Accuracy loss** | <5% | ~5-10% | ~5-15% |
| **Calibration** | Minimal (128-1K tokens) | More extensive | None |
| **Overfitting risk** | Low | Higher | N/A |
| **Best for** | Production inference | GPU inference | Easy integration |
| **vLLM support** | Native | Yes | Limited |
**Key insight**: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.
## Kernel backends
### GEMM (default, batch inference)
```python
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # Best for batch sizes > 1
}
```
### GEMV (single-token generation)
```python
quant_config = {
"version": "GEMV" # 20% faster for batch_size=1
}
```
**Limitation**: Only batch size 1, not good for large context.
### Marlin (Ampere+ GPUs)
```python
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin" # 2x faster on A100/H100
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)
```
**Requirements**: Compute Capability 8.0+ (A100, H100, RTX 40xx)
### ExLlamaV2 (AMD compatible)
```python
config = AwqConfig(
bits=4,
version="exllama" # Faster prefill, AMD GPU support
)
```
## HuggingFace Transformers integration
### Direct loading
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
```
### Fused modules (recommended)
```python
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # Max sequence length for fusing
do_fuse=True # Enable fused attention/MLP
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)
```
**Note**: Fused modules cannot combine with FlashAttention2.
## vLLM integration
```python
from vllm import LLM, SamplingParams
# vLLM auto-detects AWQ models
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)
```
## Performance benchmarks
### Memory reduction
| Model | FP16 | AWQ 4-bit | Reduction |
|-------|------|-----------|-----------|
| Mistral 7B | 14 GB | 5.5 GB | 2.5x |
| Llama 2-13B | 26 GB | 10 GB | 2.6x |
| Llama 2-70B | 140 GB | 35 GB | 4x |
### Inference speed (RTX 4090)
| Model | Prefill (tok/s) | Decode (tok/s) | Memory |
|-------|-----------------|----------------|--------|
| Mistral 7B GEMM | 3,897 | 114 | 5.55 GB |
| TinyLlama 1B GEMV | 5,179 | 431 | 2.10 GB |
| Llama 2-13B GEMM | 2,279 | 74 | 10.28 GB |
### Accuracy (perplexity)
| Model | FP16 | AWQ 4-bit | Degradation |
|-------|------|-----------|-------------|
| Llama 3 8B | 8.20 | 8.48 | +3.4% |
| Mistral 7B | 5.25 | 5.42 | +3.2% |
| Qwen2 72B | 4.85 | 4.95 | +2.1% |
## Custom calibration data
```python
# Use custom dataset for domain-specific models
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wiOrchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.
State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.