vidgrid

Name: pawvej/vidgrid
Author: pawvej

Give an AI eyes for video — turn a clip into a numbered frame grid + transcript a vision LLM can read.

MCP ServersOfficial Registry0 stars0 forks● PythonMITUpdated today

Install in Claude Code / Claude Desktop

Method: pip / Python · vidgrid

Claude Code CLI

claude mcp add vidgrid -- python -m vidgrid

claude_desktop_config.json (Claude Desktop)

{
  "mcpServers": {
    "vidgrid": {
      "command": "python",
      "args": ["-m", "if"]
    }
  }
}

1. Run the command above in your terminal (Claude Code), or paste the JSON config into claude_desktop_config.json (Claude Desktop).

2. Replace any <placeholder> values with your API keys or paths.

3. Restart Claude. The MCP server and its tools appear automatically.

💡 Install first: pip install vidgrid

Use cases

Creative AI / ML Media

About

MCP Servers overview

# vidgrid

[![PyPI](https://img.shields.io/pypi/v/vidgrid?style=flat-square&color=1f4fd1)](https://pypi.org/project/vidgrid/)
[![Python](https://img.shields.io/pypi/pyversions/vidgrid?style=flat-square)](https://pypi.org/project/vidgrid/)
[![License: MIT](https://img.shields.io/badge/license-MIT-black.svg?style=flat-square)](LICENSE)
[![Downloads](https://img.shields.io/pypi/dm/vidgrid?style=flat-square&color=e03127)](https://pypi.org/project/vidgrid/)
[![Hosted](https://img.shields.io/badge/hosted-vidgrid.site-e03127?style=flat-square)](https://vidgrid.site)

> Convert video clips into annotated image grids for vision LLM analysis.
> **One cell = one second, by default.**

![vidgrid example — a 3×3 grid generated from "Me at the zoo" with numbered cells, top-right timestamps, and real auto-captions burned in](docs/img/hero.jpg)

LLMs can't watch video, but they can analyze a single image. `vidgrid` samples
one frame per second from a video, tiles them into a numbered storyboard with
timestamps, and optionally sends the result to Claude, GPT, or Gemini with a
prompt. The result is something close to "my LLM just watched a video" for
the cost of a handful of image uploads.

**Don't want to install?** Use the hosted version at
[vidgrid.site](https://vidgrid.site) — drop a file, get the grid in the
browser. 3 free renders, $5 lifetime after that. Free for ever on the CLI.

## The model

**One cell = one second, by default.** The auto-picker chooses the smallest
grid (biggest, most-legible cells) whose board count stays under
`--max-boards` (default 10). When that's not enough for a long clip, it
bumps the grid up; as a last resort, it reduces the sampling rate. Override
with `--fps` and `--max-boards` for full control.

- Grid size determines how many seconds fit in one photo
- Default sampling is 1fps; drops below 1fps only when needed to stay under the max-boards cap
- Videos over 5 minutes are rejected (chop them up first)

| Grid | Cells | Seconds per photo | Best for |
|---|---|---|---|
| `2x2` | 4 | 4 | Very short clips (2–4s) |
| `3x3` | **9** | **9** | **Default — best overall readability** |
| `4x4` | 16 | 16 | More compact, cells get smaller |
| `5x5` | 25 | 25 | Experimental — cells small, LLM accuracy drops |

**Quality degrades with bigger grids.** Cells shrink, detail is lost, and the
LLM has a harder time reading fine content like text or UI elements. Stick
with 3×3 unless you specifically need to pack more seconds into one photo.
5×5 exists mostly as a "let me see what happens" option.

## How many photos a video produces

At 1fps sampling, the board count at each grid size:

| Video length | 2×2 | 3×3 | 4×4 | 5×5 |
|---|---|---|---|---|
| 3s    | **1** (partial) | 1 (partial) | 1 (partial) | 1 (partial) |
| 9s    | **3** | 1 | 1 (partial) | 1 (partial) |
| 25s   | **7** | 3 | 2 | 1 |
| 60s   | 15 | **7** | 4 | 3 |
| 186s (3 min) | 47 | 21 | **12** | 8 |
| 300s (5 min, cap) | 75 | 34 | **19**¹ | 12 |

**Bold** = what `auto` picks — the smallest grid (biggest cells) that
keeps the board count under `--max-boards` (default 10).

¹ At the 5-min cap, even 4×4 exceeds 10 boards at 1fps, so auto drops
the sampling rate (≈1 cell per 1.9s) to land at the 10-board limit. Use
`--fps 1.0 --max-boards 20` to preserve 1fps and accept more boards.

Most vision LLMs accept ~10–20 images per request, so auto's default
ceiling of 10 keeps a full video inside a single model call.

## Install

```bash
pip install vidgrid                         # core renderer only
pip install vidgrid[transcribe]             # + faster-whisper for --transcribe
pip install vidgrid[anthropic]              # + Claude support via --ask
pip install vidgrid[llm]                    # + Claude + GPT + Gemini
pip install vidgrid[all]                    # everything
```

Requires Python 3.9+ and `ffmpeg` on your `PATH`.

## Quick start

```bash
# 1. Auto-pick grid and sampling rate — smallest grid that fits in 10 boards
vidgrid clip.mp4 -o grid.png

# 2. Force a specific grid
vidgrid clip.mp4 -o grid.png --grid 4x4

# 3. Force a sampling rate — 0.5fps = 1 cell every 2 seconds
vidgrid long-clip.mp4 -o grid.png --fps 0.5

# 4. Raise the max-boards ceiling (default 10) if you want more boards
vidgrid lecture.mp4 -o grid.png --max-boards 20

# 5. Render + auto-transcribe + send to Claude in one call
vidgrid lecture.mp4 --transcribe --ask "bullet-point summary"

# 6. Use existing Whisper captions, burn them onto the grid
vidgrid interview.mp4 -o grid.png --captions whisper.json --burn-captions

# 7. Let the CLI fall back to python -m if the console script isn't on PATH
python3 -m vidgrid clip.mp4 -o grid.png
```

## Three things you can do with it

### 1. Summarize a talk without watching it

```bash
vidgrid "team-meeting.mp4" \
  --transcribe \
  --ask "list the decisions made and who owns each" \
  --model claude-opus-4-7
```

vidgrid samples one frame per second, runs Whisper on the audio, sends the
grid + transcript to Claude, and prints the answer. The model correlates
frames and words via the burned-in timestamps.

### 2. Find a specific moment in a screen recording

```bash
vidgrid bug-repro.mp4 --grid 3x3 \
  --ask "at which numbered frame does the error dialog appear?" \
  --model gpt-5
```

Because cells are globally numbered (1, 2, 3...) and tagged with timestamps,
the model can point you at the exact moment. No scrubbing.

### 3. Rank a pile of stock footage

```bash
for clip in broll/*.mp4; do
  vidgrid "$clip" -o "grids/$(basename $clip .mp4).png"
done
```

Send the PNGs to Claude in a single request and ask it to rank or reject
clips against your shot list. This is the workflow vidgrid was built for.

## Portrait vs landscape

vidgrid keeps the grid shape square (N×N) regardless of source orientation
and preserves the source aspect inside each cell. Landscape sources produce
wide boards; portrait sources produce tall boards. Cells are never cropped.

## Two-layer captions (default)

The default mode gives the LLM **two correlated inputs**: the rendered grid
image AND the Whisper transcript as separate text. The model correlates them
via the timestamps printed on each cell.

This beats burning captions into the image because:

1. Frames keep their pixels for actual content
2. Text is higher fidelity as tokens than as baked-in pixels
3. The grid stays clean and shareable

Add `--burn-captions` if you want a self-contained image (useful for sharing
or offline analysis).

## Caption file formats

vidgrid reads and writes three caption formats. The `--captions` flag
auto-detects from the file extension. The `--transcript-format` flag
controls what `--transcribe` writes.

| Format | Extension | Size (36 words) | When to use |
|---|---|---|---|
| `json` | `.json` | ~4.8 KB | Remotion pipelines, tools that need word confidence |
| `srt` | `.srt` | ~1.4 KB | Video editors, universal subtitle format |
| `txt` | `.txt` | ~0.4 KB | Smallest, grep-friendly, trivial to parse |

**JSON** (default, Remotion-compatible):
```json
[
  {"text": "hello", "startMs": 0, "endMs": 500, "timestampMs": 0, "confidence": 0.98},
  ...
]
```

**SRT** (SubRip subtitles):
```
1
00:00:00,000 --> 00:00:00,500
hello

2
00:00:00,500 --> 00:00:01,000
world
```

**TXT** (plain timestamped text, one word per line):
```
0.00 hello
0.50 world
```

Use any format as input, output, or both. You can mix — read an `.srt` and
write a `.txt` with `--captions foo.srt --transcript-format txt`.

## Python API

```python
from vidgrid import render

storyboard = render(
    input_path="interview.mp4",
    output_path="grid.png",
    grid="3x3",  # or "2x2", "4x4", "5x5", or None for auto
    transcribe=True,
)

print(storyboard.board_paths)        # ['grid-1.png', 'grid-2.png', ...]
print(storyboard.transcript_path)    # 'grid-transcript.json'
print(storyboard.all_samples)        # list[Sample] with timestamps
```

Modules: `vidgrid.probe`, `vidgrid.sample`, `vidgrid.compose`,
`vidgrid.captions`, `vidgrid.llm`, `vidgrid.presets`.

## Output structure

**Single-board run:**
```
grid.png                 # the storyboard
grid.json                # sidecar: timestamps, layout, source info
grid-transcript.json     # only if --transcribe or --captions was used
```

**Multi-board run:**
```
grid-1.png, grid-2.png, grid-3.png, ...
grid.json                # index covering all boards + global cell numbering
grid-transcript.json
```

Cells are numbered **globally** across boards. A 3-board run has cells 1–27
so the LLM can reference any frame without ambiguity.

## Limits and caveats

- **5-minute hard cap on video length.** Longer videos are rejected. Chop
  them up with `ffmpeg -ss START -t 300 input.mp4 chunk.mp4`.
- **No scene detection.** v1 samples strictly 1 frame per second, uniform.
  No dedupe, no shifting — the spacing is always exactly 1 second.
- **Variable-framerate videos** may have sub-frame seek drift (≤1 frame),
  which is acceptable at 1fps sampling.
- **Bigger grids hurt legibility.** A 5×5 grid has cells ~300px wide; fine
  for people and objects, marginal for dense text or code. Stick with 3×3.
- **LLM integration** uses the official SDKs (anthropic, openai, google-genai)
  and won't be installed unless you request them as extras.

## Prior art

- [IG-VLM](https://arxiv.org/abs/2403.18406) — research paper proving the grid trick works
- [llm-video-frames](https://github.com/simonw/llm-video-frames) — Simon Willison's per-frame approach
- [vcsi](https://github.com/amietn/vcsi) — contact sheets without LLMs
- [byjlw/video-analyzer](https://github.com/byjlw/video-analyzer) — whisper + sequential frames

vidgrid's differentiator: **1 cell = 1 second, numbered cells, simple CLI,
multi-provider LLM integration in one package**.

## License

MIT. The bundled Source Sans 3 font is licensed under [SIL OFL 1.1](vidgrid/assets/fonts/LICENSE-SourceSans3.md).

Topics

ai-agentsclaudeffmpegllmmcpmodel-context-protocoltranscriptvideovisionwhisper

Frequently asked

What people ask about vidgrid

What is pawvej/vidgrid?

pawvej/vidgrid is mcp servers for the Claude AI ecosystem. Give an AI eyes for video — turn a clip into a numbered frame grid + transcript a vision LLM can read. It has 0 GitHub stars and was last updated today.

How do I install vidgrid?

You can install vidgrid by cloning the repository (https://github.com/pawvej/vidgrid) or following the README instructions on GitHub. ClaudeWave also provides quick install blocks on this page.

Is pawvej/vidgrid safe to use?

pawvej/vidgrid has not been audited yet by our security agent. Review the original repository on GitHub before using it in production.

Who maintains pawvej/vidgrid?

pawvej/vidgrid is maintained by pawvej. The last recorded GitHub activity is from today, with 0 open issues.

Are there alternatives to vidgrid?

Yes. On ClaudeWave you can browse similar mcp servers at /categories/mcp, sorted by popularity or recent activity.

1-click deploy

Deploy vidgrid to your cloud

Ship this repo to production in minutes. Each platform spins up its own environment with editable env vars.

Vercel Railway Render

Embeddable badge

Maintain this repo? Add a badge to your README

Drop the badge into your GitHub README to show it's tracked on ClaudeWave. Each badge links back to this page and reflects the live Trust Score.

Markdown (README)

[![Featured on ClaudeWave](https://claudewave.com/api/badge/pawvej-vidgrid)](https://claudewave.com/repo/pawvej-vidgrid)

HTML

<a href="https://claudewave.com/repo/pawvej-vidgrid"><img src="https://claudewave.com/api/badge/pawvej-vidgrid" alt="Featured on ClaudeWave: pawvej/vidgrid" width="320" height="64" /></a>

More MCP Servers

Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations.

192.8k58.6kTypeScript

MCP ServersaiapisInstall

open-webui

today

User-friendly AI Interface (Supports Ollama, OpenAI API, ...)

141.9k20.4kPython

MCP ServersaillmInstall

google-gemini

gemini-cli

today

An open-source AI agent that brings the power of Gemini directly into your terminal.

105.3k14.1kTypeScript

MCP Serversaiai-agentsInstall

netdata

today

The fastest path to AI-powered full stack observability, even for lean teams.

79.2k6.5kC

MCP ServersaialertingInstall

D4Vinci

Scrapling

9d ago

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

64.4k6.3kPython

MCP Serversaiai-scrapingInstall

sansan0

TrendRadar

4d ago

⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载，你的 AI 舆情监控助手与热点筛选工具！聚合多平台热点 + RSS 订阅，支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机，也支持接入 MCP 架构，赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ，数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。

59.5k24.7kPython

MCP ServersaibarkInstall