Skill107 repo starsupdated 4d ago

wjs-transcribing-audio

This skill converts audio or video files into timestamped SRT subtitle files in their source language. For Chinese content, it uses Volcano (豆包) ASR for superior accuracy; for other languages it employs OpenAI Whisper API with word-level timestamps, assembling cues based on punctuation rather than relying on Whisper's problematic default segmentation. Use it when users request transcripts, subtitles, or SRT files without cross-language translation.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jianshuo/claude-skills /tmp/wjs-transcribing-audio && cp -r /tmp/wjs-transcribing-audio/wjs-transcribing-audio ~/.claude/skills/wjs-transcribing-audio

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# wjs-transcribing-audio

Spoken audio in → timestamped SRT in the same language out. **This skill stops at the source-language SRT.** Translation to another language is the next skill (`/wjs-translating-subtitles`).

## When to use

- User provides a video or audio file and wants a transcript / SRT in the source language.
- User already has a translated SRT and the source SRT is missing.
- User asks "做 SRT" / "make subtitles" / "出逐字稿" with no translation step requested yet.

## When NOT to use

- Source-language SRT already exists → skip straight to `/wjs-translating-subtitles`.
- User wants the transcript in a different language than spoken → run this skill first, then `/wjs-translating-subtitles`.
- User wants only the dub or burn-in → if SRT exists, skip; otherwise run this first.

## Routing: which engine

| Source language | Default engine | Why |
|---|---|---|
| Chinese (zh-CN, zh-HK, zh-TW) | **Volcano (豆包) ASR** | Materially better accuracy than Whisper for Chinese — user's standing preference |
| Any other (es, en, pt, fr, it, ja, ko, …) | **OpenAI Whisper API** with word-level granularity | Whisper's multilingual is strong; word timestamps let us assemble cues ourselves |
| Offline / no API access | Local `openai-whisper` (medium) | Quality floor; same loop/blob failure modes apply |

For Chinese, do **not** default to Whisper unless the user explicitly asks for it or Volcano is unavailable. This is a deliberate routing decision — see user's memory on Chinese ASR priority.

## OpenAI Whisper API path (non-Chinese, and Chinese fallback)

**The key principle: do not request `response_format=srt`.** Whisper cue-segmentation fails on long monologues (30-second blob cues) and quiet stretches (loop hallucinations). Request word-level timestamps and assemble cues yourself — the post-processing is deterministic and free.

### Why not response_format=srt

Two failure modes that wreck `whisper-1` SRT output on long content:

1. **30-second blob cues.** In long monologues, `whisper-1` with `response_format=srt` emits one cue covering the full 30s `condition_on_previous_text` window. Transcript is fine; timing is unusable for on-screen reading.
2. **Loop hallucination on quiet tails.** Greedy `temperature=0` on low-energy audio produces "你如果不把拥抱浪费写在这上面,你很难的" repeated 50 times.

Both stem from letting Whisper decide cue boundaries. Fix: word-level timestamps + your own punctuation-aware assembler.

### Calling the API

```bash
# 1. Compress for upload — 64kbps mono MP3 is plenty for speech.
#    OpenAI limit is 25MB per request; chunk into 10-min pieces
#    (≈4.5MB at 64kbps) for resilience under flaky proxies.
ffmpeg -hide_banner -loglevel error -y \
  -ss <start> -t 600 -i input.mp4 \
  -vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3
```

```python
# 2. Request word-level timestamps. Do NOT request response_format=srt.
import httpx, os
data = {
    "model": "whisper-1",
    "language": "es",                        # pin source language; never auto-detect
    "response_format": "verbose_json",
    "timestamp_granularities[]": "word",     # ← the critical flag
    "temperature": "0.2",                    # enable fallback chain (anti-loop)
}
with open("chunk.mp3", "rb") as f:
    r = httpx.post(
        "https://api.openai.com/v1/audio/transcriptions",
        headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
        data=data,
        files={"file": ("chunk.mp3", f, "audio/mpeg")},
        timeout=600.0,
    )
r.raise_for_status()
j = r.json()
words    = j["words"]      # [{"word": "hola", "start": 0.12, "end": 0.34}, ...]
segments = j["segments"]   # see surprise below
```

### Surprise: words[] has no punctuation, segments[] is inconsistent

Whisper's `words[]` array typically has **no punctuation** in `word["word"]` — each entry is a bare token like `"做"`, `"个"`, `"测"`, `"试"`. Punctuation, when present, lives only in `segments[]` `text` field.

Worse, `segments[]` text is **inconsistently punctuated** across chunks of the same file: chunk 0 of a 79-min podcast might emit 285 bare segments ("做个测试" "你在" "呵呵") at 1-2s each with no punctuation; chunk 7 might emit 34 segments at 14-30s each *with* punctuation. Both behaviors ship in the same API response.

So the right recipe combines both: use `segments[]` for natural pause boundaries (already aligned to breath), but treat them as raw input to your own cue assembler, which uses word timestamps to split anywhere the segments are too long.

### Cue assembly recipe

```python
TARGET_DUR = 3.0   # try to make cues this long
MAX_CUE_DUR = 5.0  # never exceed
MAX_CHARS = 18     # ~one line at Fontsize 14 on 1080-wide vertical
MAX_GAP = 1.0      # silence threshold → force cue boundary
MIN_PIECE = 0.3    # below this, merge with neighbor
SPLIT_PUNCT = set("，。！？；,.;!?")

# Step A: merge short segments[] toward TARGET_DUR (use segments,
#         not words — Whisper's segment boundaries are already
#         pause-aligned).
def assemble(segments, offset):
    cues, buf = [], []
    def flush():
        if buf:
            cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset,
                         "".join(s["text"].strip() for s in buf)))
            buf.clear()
    for s in segments:
        dur = s["end"] - s["start"]
        # Long single segment WITH internal punct → split standalone
        if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT):
            flush(); cues.extend(split_long_segment(s, offset)); continue
        if not buf: buf.append(s); continue
        if (s["start"] - buf[-1]["end"]) >= MAX_GAP \
           or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR \
           or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR:
            flush()
        buf.append(s)
    flush(); return cues

# Step B: final pass — split every internal comma/period to its own cue
#         (proportional timestamps by char position). Coalesce pieces
#         shorter than MIN_PIECE forward.

# Step C: any

More from this repository

skill-quality-reviewerSubagent

Repo-wide drift detector for the wjs-* Claude Code skills in this marketplace. Sweeps every SKILL.md, scores it against the repo's own conventions (V-ing naming, trigger-phrase density, companion files, description shape), and returns a grouped punch list ordered by severity. Read-only — never edits files. Use before pushing a batch of skill changes, or whenever you wonder "are these skills still internally consistent?

wangjianshuo-perspectiveSkill

wjs-auditing-projectSkill

Use when the user asks to audit what's wrong with a project, "make it right", "看看项目出了什么问题", "为什么用户的需求还没上线", "为什么没提交App Store", "为什么没新build", or wants a holistic state-of-the-project check covering unmerged branches, stalled PRs, failed GitHub Actions, stale builds, plan drift (TODOS.md / ROADMAP), unreleased commits, and log errors. Runs read-only investigation, presents a grouped checklist, fixes only after explicit user confirmation. Aware of the Cathier iOS app workflow (Xcode + fastlane + auto-merge @claude PRs from in-app feedback).

wjs-burning-subtitlesSkill

Use when the user has a video + an SRT and wants the subtitles either burned into the pixels (libass, always-visible) or soft-muxed as a togglable track. Also handles the final composite step for the localization pipeline — burn subs, mix a dub track, and keep the original audio as a low-volume bed, all in ONE ffmpeg encode (no cascade). Verifies libass availability and auto-downloads a static evermeet ffmpeg build when Homebrew's stripped binary lacks it. Triggers — "烧字幕", "硬字幕", "burn subtitles", "burn-in subs", "embed subtitle", "soft mux SRT", "把字幕烧进视频", "做最终合成".

wjs-cleaning-spamSkill

Use when the user complains about spam on his X/Twitter posts — 同城面付 / 寻固炮 / 线下上门 / 免费破处这类引流号在他推文下刷的 emoji 垃圾回复 — and wants them removed. Covers the last 7 days (X recent-search window). Triggers — "把这些spam删掉", "清理X垃圾回复", "推文下面好多引流号", "clean spam replies", "/wjs-cleaning-spam".

wjs-converting-text-to-videoSkill

Use when the user wants a 王建硕-style WeChat article (article.md) turned into a narrated short MP4 video — TTS voiceover via 火山引擎 Volcano TTS, HyperFrames CSS/GSAP animation per scene, subtle SFX, abstract watercolor background, full pipeline rendering to 1080×1920 portrait MP4 (30-90s). Triggers — "把这篇文章做成视频", "做一个解说视频", "讲解视频", "/wjs-converting-text-to-video".

wjs-converting-wp-to-hugoSkill

Use when migrating a WordPress site to a Hugo static site on GitHub Pages from a WXR export (.xml) plus the wp-content/uploads folder — preserving /archives/<id>/ URLs, localizing images, and deploying via GitHub Actions. Triggers — "把 WordPress 迁成 Hugo", "wordpress 转静态站", "migrate WordPress to Hugo", "WXR to Hugo", "publish WordPress to GitHub Pages", "/wjs-converting-wp-to-hugo".

wjs-dubbing-videoSkill

Use when the user has a video + a target-language SRT and wants the video to actually speak that language — generates a time-aligned TTS voice dub. Routes by voice ID — Volcano (豆包) TTS for Chinese, edge-tts neural for any language. Defaults to one voice (single-speaker); opt-in multi-speaker via visual diarization. Outputs `*_<lang>_dub.mp4` with the dub audio in place of the original. Final mixing (audio bed + burn-in) is handed off to `/wjs-burning-subtitles`. Triggers — "配音", "中文配音", "Chinese dub", "voice over this", "dub the video", "TTS this SRT", "different voice for each speaker".