video-subtitles-and-audio-insert-workflow
This Python-based workflow uses moviepy and system fonts to embed SRT subtitle files directly into videos, supporting multilingual text including Chinese characters. Use it when you need to burn hard subtitles into video files with customizable styling, positioning, and font support across platforms, particularly for content requiring CJK (Chinese, Japanese, Korean) text rendering.
git clone --depth 1 https://github.com/inclusionAI/AWorld /tmp/video-subtitles-and-audio-insert-workflow && cp -r /tmp/video-subtitles-and-audio-insert-workflow/aworld-skills/video_subtitles_audios_insert ~/.claude/skills/video-subtitles-and-audio-insert-workflowSKILL.md
## 1. Choosing a Technical Approach
### Recommended: Python moviepy + CJK fonts
- **Tools**: moviepy 2.x
- **Fonts**: System CJK fonts (e.g. STHeiti, Songti, PingFang)
- **Pros**: Cross-platform, supports Chinese, easy styling control
- **Cons**: Slower processing (~40s for an 80s video)
### Alternative: FFmpeg + libass (requires rebuild)
- **Tools**: FFmpeg with libass support
- **Pros**: Fast processing
- **Cons**: Requires rebuilding FFmpeg; complex setup
---
## 2. Core Code Template
```python
#!/usr/bin/env python3
import re
from moviepy import VideoFileClip, TextClip, CompositeVideoClip
def parse_srt(srt_file):
"""Parse an SRT subtitle file."""
with open(srt_file, 'r', encoding='utf-8') as f:
content = f.read()
blocks = content.strip().split('\n\n')
subtitles = []
for block in blocks:
lines = block.strip().split('\n')
if len(lines) >= 3:
time_line = lines[1]
match = re.match(r'(\d{2}):(\d{2}):(\d{2}),(\d{3}) --> (\d{2}):(\d{2}):(\d{2}),(\d{3})', time_line)
if match:
start_h, start_m, start_s, start_ms, end_h, end_m, end_s, end_ms = match.groups()
start_time = int(start_h) * 3600 + int(start_m) * 60 + int(start_s) + int(start_ms) / 1000
end_time = int(end_h) * 3600 + int(end_m) * 60 + int(end_s) + int(end_ms) / 1000
text = '\n'.join(lines[2:])
subtitles.append(((start_time, end_time), text))
return subtitles
def make_textclip(txt, font_path, font_size=40):
"""Create a subtitle text clip."""
return TextClip(
text=txt,
font_size=font_size, # Tune for resolution
color='white',
font=font_path, # CJK-capable font path
stroke_color='black',
stroke_width=2.5,
method='caption',
size=(1100, None), # 1100px width, auto height
text_align='center'
)
def add_subtitles(video_path, srt_path, output_path, font_path, font_size=40, bottom_margin=100):
"""Burn hard subtitles into a video."""
video = VideoFileClip(video_path)
subtitles = parse_srt(srt_path)
subtitle_clips = []
for (start, end), text in subtitles:
txt_clip = make_textclip(text, font_path, font_size)
txt_clip = txt_clip.with_start(start).with_end(end)
# Position: pixels from bottom (avoids wrapped lines past the lower edge)
txt_clip = txt_clip.with_position(('center', video.h - bottom_margin))
subtitle_clips.append(txt_clip)
final_video = CompositeVideoClip([video] + subtitle_clips)
# Important: cap bitrate to avoid huge files
# Prefer checking source bitrate first, then ~1.2–1.5× that value
final_video.write_videofile(
output_path,
codec='libx264',
audio_codec='aac',
fps=video.fps,
preset='medium',
bitrate='600k', # Tune to source (often 400–800k)
threads=4
)
video.close()
# Example usage
if __name__ == '__main__':
add_subtitles(
video_path='input_video.mp4',
srt_path='subtitles.srt',
output_path='output_video_with_subtitles.mp4',
font_path='/System/Library/Fonts/STHeiti Medium.ttc', # macOS
font_size=40, # e.g. 40px for 1280×720
bottom_margin=100 # 100px from bottom
)
```
---
## 3. Key Parameter Settings
### 3.1 Font choice (critical)
```python
# macOS
font_path = '/System/Library/Fonts/STHeiti Medium.ttc' # STHeiti (recommended)
# or
font_path = '/System/Library/Fonts/Supplemental/Songti.ttc' # Songti
# Linux
font_path = '/usr/share/fonts/truetype/wqy/wqy-microhei.ttc' # WenQuanYi Micro Hei
# Windows
font_path = 'C:/Windows/Fonts/msyh.ttc' # Microsoft YaHei
```
**Note**: You must use a font that includes the glyphs you need (e.g. Chinese); otherwise subtitles show as boxes.
### 3.2 Font size by resolution
| Resolution | Recommended size | Notes |
|------------|------------------|-------|
| 1280×720 | 40px | HD |
| 1920×1080 | 60px | Full HD |
| 3840×2160 | 120px | 4K |
### 3.3 Position
```python
# Pixels from bottom ≈ font_size × 2.5
bottom_margin = font_size * 2.5
# Example: 40px font
bottom_margin = 100 # 100px from bottom
# Y position
position_y = video.h - bottom_margin
```
### 3.4 Bitrate (avoid oversized files)
```python
# Step 1: inspect source bitrate
# ffprobe -v error -show_entries format=bit_rate input.mp4
# Step 2: set output bitrate (often 1.2–1.5× source)
# Example:
# source ≈ 444 kbps → output ≈ 600 kbps (~1.35×)
bitrate='600k'
```
---
## 4. Common Issues and Fixes
### Issue 1: Subtitles show as boxes
**Cause**: Font lacks the needed glyphs (e.g. using Arial or Times New Roman for Chinese).
**Fix**: Use a CJK-capable font (STHeiti, Songti, Microsoft YaHei, etc.).
### Issue 2: Output file size explodes
**Cause**: Bitrate set too high (e.g. 5000 kbps).
**Fix**:
```python
# Check source bitrate
ffprobe -v error -show_entries format=bit_rate input.mp4
# Set a sensible bitrate (~1.2–1.5× source)
bitrate='600k' # if source was ~444 kbps
```
### Issue 3: Wrapped lines extend past the bottom
**Cause**: Font too large or position too low.
**Fix**:
- Reduce font size (e.g. 48px → 40px)
- Raise position (e.g. 80px → 100px from bottom)
- Use: `bottom_margin = font_size * 2.5`
### Issue 4: Subtitles look faint or unclear
**Cause**: Stroke too thin or poor contrast.
**Fix**:
```python
color='white',
stroke_color='black',
stroke_width=2.5 # often 2–3px works well
```
---
## 5. End-to-End Workflow
### Step 1: Prepare the subtitle file
```bash
# Ensure UTF-8 SRT
file -I subtitles.srt
# Should include: charset=utf-8
# If wrong encoding, convert
iconv -f GBK -t UTF-8 subtitles_gbk.srt > subtitles_utf8.srt
```
### Step 2: Inspect the source video
```bash
# Resolution
ffprobe -v error -show_entries stream=width,height input.mCreate ad-ready product images (single or collage) by back-solving sub-image sizes from target output ratio, grounding scene design with media_comprehension, generating images via image_generator with strict request params and actor-count control, and pairing each deliverable with a short social tagline for 小红书/抖音.
Create ad-ready product video from product images, with or without character/subject images. The workflow leverages AI-powered image composition, scene understanding, and video generation. Video prompts should follow commercial shot language—visual hooks, product presence, hero shots, detail showcase, function expression, and dynamic visuals.
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.
A professional skill for App Evaluation (evaluating app's performance with score) and App Improvement (giving professional suggestions for improving the app's performance).
>-
Search and summarize the latest 7 days of AI news and X discussions using public sources plus browser-based X collection. Use for recent AI news, trends, X discussions, industry briefs, and summaries organized into hot topics, viewpoints, and opportunity areas.
An intelligent assistant specialized in handling media files (images/audio/video). **Only for media file analysis**, does not handle document types.\n\n✅ Media files that can be processed:\n- Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg\n- Audio: .mp3, .wav, .m4a, .flac, .aac, .ogg\n- Video: .mp4, .avi, .mov, .mkv, .webm, .flv\n\n❌ Files that cannot be processed (please do not trigger this skill):\n- Documents: .pdf, .doc, .docx, .txt, .md, .rtf\n- Spreadsheets: .xlsx, .xls, .csv, .tsv\n- Presentations: .pptx, .ppt, .key\n- Code: .py, .js, .ts, .java, .cpp, .go, .rs\n- Archives: .zip, .tar, .gz, .rar, .7z\n- Executables: .exe, .bin, .app, .dmg\n- Databases: .db, .sqlite, .sql\n- Configuration files: .json, .xml, .yaml, .yml, .toml, .ini\n- Web pages: .html, .htm, .css\n\n**Trigger conditions**: When the user explicitly requests to analyze image/audio/video content, or when the file extension belongs to the aforementioned media types.".
Analyzes and automatically optimizes existing agents by improving system prompts and tool configuration.