Skill11.2k repo starsupdated 1mo ago

hqq-quantization

HQQ provides fast, calibration-free weight quantization for large language models at 8/4/3/2/1-bit precision without requiring calibration datasets. Use this skill when quantizing models quickly for deployment with vLLM or HuggingFace Transformers, fine-tuning with LoRA, or experimenting with extreme quantization levels where standard calibration-based methods like GPTQ or AWQ are impractical.

View source Repository: AI-Research-SKILLs

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Orchestra-Research/AI-Research-SKILLs /tmp/hqq-quantization && cp -r /tmp/hqq-quantization/10-optimization/hqq ~/.claude/skills/hqq-quantization

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# HQQ - Half-Quadratic Quantization

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.

## When to use HQQ

**Use HQQ when:**
- Quantizing models without calibration data (no dataset needed)
- Need fast quantization (minutes vs hours for GPTQ/AWQ)
- Deploying with vLLM or HuggingFace Transformers
- Fine-tuning quantized models with LoRA/PEFT
- Experimenting with extreme quantization (2-bit, 1-bit)

**Key advantages:**
- **No calibration**: Quantize any model instantly without sample data
- **Multiple backends**: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
- **Flexible precision**: 8/4/3/2/1-bit with configurable group sizes
- **Framework integration**: Native HuggingFace and vLLM support
- **PEFT compatible**: Fine-tune quantized models with LoRA

**Use alternatives instead:**
- **AWQ**: Need calibration-based accuracy, production serving
- **GPTQ**: Maximum accuracy with calibration data available
- **bitsandbytes**: Simple 8-bit/4-bit without custom backends
- **llama.cpp/GGUF**: CPU inference, Apple Silicon deployment

## Quick start

### Installation

```bash
pip install hqq

# With specific backend
pip install hqq[torch]      # PyTorch backend
pip install hqq[torchao]    # TorchAO int4 backend
pip install hqq[bitblas]    # BitBlas backend
pip install hqq[marlin]     # Marlin backend
```

### Basic quantization

```python
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn

# Configure quantization
config = BaseQuantizeConfig(
    nbits=4,           # 4-bit quantization
    group_size=64,     # Group size for quantization
    axis=1             # Quantize along output dimension
)

# Quantize a linear layer
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)

# Use normally
output = hqq_linear(input_tensor)
```

### Quantize full model with HuggingFace

```python
from transformers import AutoModelForCausalLM, HqqConfig

# Configure HQQ
quantization_config = HqqConfig(
    nbits=4,
    group_size=64,
    axis=1
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config,
    device_map="auto"
)

# Model is quantized and ready to use
```

## Core concepts

### Quantization configuration

HQQ uses `BaseQuantizeConfig` to define quantization parameters:

```python
from hqq.core.quantize import BaseQuantizeConfig

# Standard 4-bit config
config_4bit = BaseQuantizeConfig(
    nbits=4,           # Bits per weight (1-8)
    group_size=64,     # Weights per quantization group
    axis=1             # 0=input dim, 1=output dim
)

# Aggressive 2-bit config
config_2bit = BaseQuantizeConfig(
    nbits=2,
    group_size=16,     # Smaller groups for low-bit
    axis=1
)

# Mixed precision per layer type
layer_configs = {
    "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
    "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
    "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
```

### HQQLinear layer

The core quantized layer that replaces `nn.Linear`:

```python
from hqq.core.quantize import HQQLinear
import torch

# Create quantized layer
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)

# Access quantized weights
W_q = hqq_layer.W_q           # Quantized weights
scale = hqq_layer.scale       # Scale factors
zero = hqq_layer.zero         # Zero points

# Dequantize for inspection
W_dequant = hqq_layer.dequantize()
```

### Backends

HQQ supports multiple inference backends for different hardware:

```python
from hqq.core.quantize import HQQLinear

# Available backends
backends = [
    "pytorch",          # Pure PyTorch (default)
    "pytorch_compile",  # torch.compile optimized
    "aten",            # Custom CUDA kernels
    "torchao_int4",    # TorchAO int4 matmul
    "gemlite",         # GemLite CUDA kernels
    "bitblas",         # BitBlas optimized
    "marlin",          # Marlin 4-bit kernels
]

# Set backend globally
HQQLinear.set_backend("torchao_int4")

# Or per layer
hqq_layer.set_backend("marlin")
```

**Backend selection guide:**
| Backend | Best For | Requirements |
|---------|----------|--------------|
| pytorch | Compatibility | Any GPU |
| pytorch_compile | Moderate speedup | torch>=2.0 |
| aten | Good balance | CUDA GPU |
| torchao_int4 | 4-bit inference | torchao installed |
| marlin | Maximum 4-bit speed | Ampere+ GPU |
| bitblas | Flexible bit-widths | bitblas installed |

## HuggingFace integration

### Load pre-quantized models

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load HQQ-quantized model from Hub
model = AutoModelForCausalLM.from_pretrained(
    "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Use normally
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
```

### Quantize and save

```python
from transformers import AutoModelForCausalLM, HqqConfig

# Quantize
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=config,
    device_map="auto"
)

# Save quantized model
model.save_pretrained("./llama-8b-hqq-4bit")

# Push to Hub
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
```

### Mixed precision quantization

```python
from transformers import AutoModelForCausalLM, HqqConfig

# Different precision per layer type
config = HqqConfig(
    nbits=4,
    group_size=64,
    # Attention layers: higher precision
    # MLP layers: lower precision for memory savings
    dynamic_config={
        "attn": {"nbits": 4, "grou

More from this repository

autoresearchSkill

Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.

implementing-llms-litgptSkill

Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.

mamba-architectureSkill

State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.

nanogptSkill

Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).

rwkv-architectureSkill

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

distributed-llm-pretraining-torchtitanSkill

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

huggingface-tokenizersSkill

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

sentencepieceSkill

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.