Skill240 repo starsupdated 10d ago

gemma4-local-deploy

This Claude Code skill deploys Google's Gemma 4 12B model locally on Mac using llama.cpp with GGUF quantization, exposing an OpenAI-compatible API via llama-server. Use it when users request local model deployment, long context windows (32K to 256K tokens), quantized model options (Q4_K_M or QAT Q4_0), or need to compare resource usage between different configurations on Apple Silicon hardware.

View source Repository: spellbook

Install in Claude Code

Copy

git clone --depth 1 https://github.com/majiayu000/spellbook /tmp/gemma4-local-deploy && cp -r /tmp/gemma4-local-deploy/skills/gemma4-local-deploy ~/.claude/skills/gemma4-local-deploy

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Gemma 4 12B 本地部署

目标：把 Gemma 4 12B 的 GGUF 版本部署成本机模型服务。默认用 `llama.cpp` / `llama-server` + Apple Metal + `Q4_K_M` + `tmux` 暴露 OpenAI-compatible API；用户明确要 QAT、256K 或对比演示时，切到 `QAT Q4_0` profile；用户明确要 Ollama 时，再走 Ollama 导入路径。

## 默认选择

- 默认模型仓库：`ggml-org/gemma-4-12B-it-GGUF`
- 默认量化：`Q4_K_M`
- 默认模型名：`gemma-4-12b-it`
- 默认端口：`127.0.0.1:8080`
- 默认上下文：`32768`
- 12B 长上下文：用户明确要求更大上下文时，可改为 `65536` 或原生最高 `131072`
- QAT 仓库：`google/gemma-4-12B-it-qat-q4_0-gguf`
- QAT 量化：`Q4_0`，文件名通常是 `gemma-4-12b-it-qat-q4_0.gguf`
- QAT 上下文：用户要求 QAT、最大上下文或 256K 时，用 `262144`
- 默认后台方式：`tmux` 会话 `gemma4-12b`
- 默认关闭 thinking：`--reasoning off`，避免 OpenAI API 的 `message.content` 为空
- Ollama 路径：只在用户明确要 Ollama、需要接 Ollama 生态，或询问 `ollama pull gemma4:12b` 时使用

如果用户明确要更高质量，优先建议 `Q6_K` 或 `Q8_0`；不要默认上 `bf16`，除非用户接受更大内存和更慢加载。QAT 是训练时模拟量化以降低压缩后的质量损失，不等于无损；关键任务仍要做当前会话验证。

## Profile 选择

先根据用户目标选择 profile。不要把 256K 当作默认值，也不要在用户只要日常本地服务时自动切 QAT。

| Profile | When to choose | Model / quant | Context | Port / alias |
|---|---|---|---:|---|
| `daily-q4km-32k` | 默认日常聊天、编码、低风险本地 API | `ggml-org/...:Q4_K_M` | `32768` | `8080` / `gemma-4-12b-it` |
| `long-q4km-128k` | 用户明确要更长上下文，但仍想保留默认 GGUF 路线 | `ggml-org/...:Q4_K_M` | `65536` or `131072` | `8080` / `gemma-4-12b-it` |
| `qat-q4_0-256k` | 用户说 QAT、Q4_0、256K、Google QAT blog、低内存长上下文 | `google/...qat-q4_0-gguf:Q4_0` | `262144` | `8080` / `gemma-4-12b-it-qat-q4_0` |
| `compare-32k-vs-256k` | 用户要录屏、演示、A/B 对比资源和速度 | left `Q4_K_M`, right `QAT Q4_0` | `32768` + `262144` | `8080` + `8081` |

选择后在最终回复里说清楚 profile、端口、上下文和为什么这么选。

## 执行流程

### 1. 搜索并确认现状

先查已有安装、进程、端口和模型缓存，避免重复部署：

```bash
command -v llama-server || true
llama-server --version || true
tmux has-session -t gemma4-12b 2>/dev/null && tmux display-message -p -t gemma4-12b '#S #{pane_pid}' || true
lsof -nP -iTCP:8080 -sTCP:LISTEN || true
ls -lh "$HOME/Library/Caches/llama.cpp/"*gemma-4-12B-it*Q4_K_M*.gguf 2>/dev/null || true
find "$HOME/Library/Caches/llama.cpp" "$HOME/Models" \( -name '*gemma-4-12b-it-qat-q4_0*.gguf' -o -name '*gemma-4-12B-it-qat-q4_0*.gguf' \) 2>/dev/null || true
```

On Mac, also record hardware:

```bash
system_profiler SPHardwareDataType | sed -n '1,30p'
df -h "$HOME"
```

### 2. Install or upgrade llama.cpp

Use Homebrew on macOS:

```bash
brew install llama.cpp
# If already installed, upgrade only this package when possible.
brew upgrade llama.cpp
llama-server --version
```

Gemma 4 GGUF requires a `llama.cpp` build that recognizes `general.architecture = gemma4`.
If loading fails with:

```text
unknown model architecture: 'gemma4'
```

then upgrade `llama.cpp` and retry. A verified good local build was `9430`; newer stable or HEAD is also acceptable.

### 3. Download/load the model

For the default `daily-q4km-32k` profile, first-run download can be done by `llama-server -hf`:

```bash
llama-server \
  -hf ggml-org/gemma-4-12B-it-GGUF:Q4_K_M \
  --no-mmproj \
  --ctx-size 32768 \
  --gpu-layers 99 \
  --parallel 1 \
  --reasoning off \
  --host 127.0.0.1 \
  --port 8080 \
  --alias gemma-4-12b-it
```

After the model is cached, prefer starting with the local file path. Typical cache path:

```text
$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf
```

For the `qat-q4_0-256k` profile, use the Google QAT GGUF repo:

```bash
llama-server \
  -hf google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 \
  --ctx-size 262144 \
  --gpu-layers 99 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning off \
  --host 127.0.0.1 \
  --port 8080 \
  --alias gemma-4-12b-it-qat-q4_0
```

If the download should be explicit or reusable outside the llama.cpp cache:

```bash
mkdir -p "$HOME/Models/gemma4-qat"
huggingface-cli download google/gemma-4-12B-it-qat-q4_0-gguf \
  gemma-4-12b-it-qat-q4_0.gguf \
  --local-dir "$HOME/Models/gemma4-qat"
```

### 4. Run persistently with tmux

If port `8080` is free and no `gemma4-12b` session exists:

```bash
tmux new-session -d -s gemma4-12b 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 32768 --gpu-layers 99 --parallel 1 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'
```

If `$HOME` is not expanded inside single quotes in the target shell, use the absolute path instead.

For `qat-q4_0-256k` with an explicit local file:

```bash
tmux new-session -d -s gemma4-qat-256k 'llama-server -m "$HOME/Models/gemma4-qat/gemma-4-12b-it-qat-q4_0.gguf" --ctx-size 262144 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it-qat-q4_0'
```

For `compare-32k-vs-256k`, keep separate session names and ports:

```bash
tmux new-session -d -s gemma4-left-32k 'llama-server -m "$HOME/Library/Caches/llama.cpp/ggml-org_gemma-4-12B-it-GGUF_gemma-4-12B-it-Q4_K_M.gguf" --ctx-size 32768 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8080 --alias gemma-4-12b-it'
tmux new-session -d -s gemma4-right-256k 'llama-server -m "$HOME/Models/gemma4-qat/gemma-4-12b-it-qat-q4_0.gguf" --ctx-size 262144 --gpu-layers 99 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --reasoning off --host 127.0.0.1 --port 8081 --alias gemma-4-12b-it-qat-q4_0'
```

Management commands:

```bash
tmux attach -t gemma4-12b
tmux kill-session -t gemma4-12b
tmux kill-session -t gemma4-qat-256k
tmux kill-session -t gemma4-left-32k
tmux kill-session -t gemma4-right-256k
```

### 5. Increase 12B context when requested

Do not tell the user 12B is limited to 32K. `32768` is the conservative default startup value. The 12B GGUF metadata can support a native training context of `131072`.

Use this selection table:

| User need | `--ctx-size` | Notes |
|---|---:|---|
| Fast daily chat / low memory | `32768` | Default. |
| Long coding sessions or medium documents | `65536` | Good balance on 16GB+ Macs if memory pressure is acc