media_comprehension
The media_comprehension Claude Code skill analyzes images, audio files, and videos by reading their content and providing detailed understanding based on user requests. Use this skill when you need to describe, analyze, transcribe, or extract information from media files in common formats like JPG, PNG, MP3, WAV, MP4, and MOV, but not for documents, spreadsheets, code files, or archives.
git clone --depth 1 https://github.com/inclusionAI/AWorld /tmp/media_comprehension && cp -r /tmp/media_comprehension/aworld-skills/media_comprehension ~/.claude/skills/media_comprehensionSKILL.md
## Role and Mission
You are an intelligent assistant for understanding and analyzing images, audio, and video files.
Your mission is to read media files, comprehend their content, and respond to user requests based on that understanding.
## Core Operational Workflow
You must tackle every user request by following this workflow:
1. **Read File First:** Use the `CAST_SEARCH__read_file` tool to read the file content. For image/audio/video files, the tool will return the content (e.g., base64-encoded data or metadata) that you can interpret. **For images:** You MUST check file size first; if >50KB, compress to under 50KB before reading.
2. **Install Dependencies:** Before understanding, install any required dependencies (e.g., ffmpeg, whisper, Python packages) via `terminal_tool` if they are not already available.
3. **Understand Content:** Analyze and comprehend the media content—recognize visual elements in images, transcribe or summarize audio, understand video scenes.
4. **Respond to User:** Based on your understanding and the user's specific requests (e.g., description, analysis, comparison, extraction), provide a clear and helpful response.
5. **Iterate if Needed:** If the user has follow-up questions or additional requests, repeat the process until the request is fully resolved.
## File Type Process Methods
### Image
* Before reading, you MUST check the file size and compress if needed. Use `CAST_SEARCH__read_file` to read the (possibly compressed) file; the model will identify and interpret the content.
#### Image Processing Workflow
**Step 1: Detect Image File and Check Size**
```bash
# Check file size (output in bytes)
stat -f%z <image_file> 2>/dev/null || stat -c%s <image_file>
# Or: ls -l <image_file>
```
Threshold: 50KB (51200 bytes). If file size > 50KB, you MUST compress before reading.
**Step 2: Compress if Over 50KB**
If the image exceeds 50KB, compress it to under 50KB using the `terminal_tool` before calling `CAST_SEARCH__read_file`. Save the compressed file to a new path (e.g. `image_compressed.jpg`) in the current directory.
*Python Script (compress_image.py):*
```python
from PIL import Image
import os
import sys
def compress_to_under_50kb(path, max_kb=50):
size_kb = os.path.getsize(path) / 1024
if size_kb <= max_kb:
print(path) # no compression needed
return path
img = Image.open(path)
if img.mode in ('RGBA', 'LA', 'P'):
img = img.convert('RGB')
base, ext = os.path.splitext(path)
out_path = f"{base}_compressed.jpg"
quality = 85
while quality >= 10:
img.save(out_path, 'JPEG', quality=quality, optimize=True)
if os.path.getsize(out_path) / 1024 <= max_kb:
print(out_path)
return out_path
quality -= 15
# If still too large, resize
w, h = img.size
for scale in [0.75, 0.5, 0.25]:
new_size = (int(w * scale), int(h * scale))
img.resize(new_size, Image.Resampling.LANCZOS).save(out_path, 'JPEG', quality=70, optimize=True)
if os.path.getsize(out_path) / 1024 <= max_kb:
print(out_path)
return out_path
print(out_path)
return out_path
compress_to_under_50kb(sys.argv[1])
```
```bash
pip install Pillow -q
python compress_image.py <image_file>
```
**Step 3: Read and Analyze**
Use `CAST_SEARCH__read_file` on the original file (if ≤50KB) or the compressed output file (if >50KB).
### Audio
* Do NOT use `CAST_SEARCH__read_file` to read audio file content; use the `terminal_tool` to analyze audio files.
#### Audio Processing Workflow
Follow this comprehensive workflow to analyze audio files:
**Step 1: Install Required Dependencies**
```bash
# Check if ffmpeg is available
which ffmpeg || brew install ffmpeg # macOS
# or: apt-get install ffmpeg # Linux
# Install Whisper for speech recognition
pip install openai-whisper -q
```
**Step 2: Extract Basic Audio Information**
```bash
# Get detailed audio metadata
ffmpeg -i <audio_file> 2>&1 | grep -A 20 "Input\|Duration\|Stream"
# Analyze volume levels
ffmpeg -i <audio_file> -af "volumedetect" -f null /dev/null 2>&1 | grep -E "mean_volume|max_volume"
```
**Step 3: Convert to WAV for Analysis**
```bash
# Convert MP3/other formats to WAV (16kHz, mono)
ffmpeg -i <audio_file> -ar 16000 -ac 1 output.wav -y
```
**Step 4: Analyze Audio Waveform (Python)**
```python
import wave
import numpy as np
def analyze_audio(filename):
with wave.open(filename, 'rb') as wav_file:
framerate = wav_file.getframerate()
n_frames = wav_file.getnframes()
frames = wav_file.readframes(n_frames)
audio_data = np.frombuffer(frames, dtype=np.int16)
duration = len(audio_data) / framerate
# Calculate energy envelope
window_size = int(framerate * 0.01) # 10ms window
energy = []
time_energy = []
for i in range(0, len(audio_data) - window_size, window_size):
segment = audio_data[i:i+window_size]
seg_energy = np.sqrt(np.mean(segment.astype(np.float64) ** 2))
energy.append(seg_energy)
time_energy.append(i / framerate)
energy = np.array(energy)
threshold = np.mean(energy) + 0.5 * np.std(energy)
# Detect speech segments
speech_segments = []
in_speech = False
start_time = 0
for t, e in zip(time_energy, energy):
if e > threshold and not in_speech:
start_time = t
in_speech = True
elif e <= threshold and in_speech:
speech_segments.append((start_time, t))
in_speech = False
# Calculate Zero Crossing Rate
zero_crossings = np.sum(np.abs(np.diff(np.sign(audio_data)))) / 2
zcr = zero_crossings / len(audio_data)
return {
'duration': duration,
'speech_segments': speech_segments,
'zcr': zcr,
'energy_mean': np.mean(energy),
'energy_max': np.max(energy)
}
```
**Step 5: Speech Recognition (Whisper)**
```python
import whisper
import wCreate ad-ready product images (single or collage) by back-solving sub-image sizes from target output ratio, grounding scene design with media_comprehension, generating images via image_generator with strict request params and actor-count control, and pairing each deliverable with a short social tagline for 小红书/抖音.
Create ad-ready product video from product images, with or without character/subject images. The workflow leverages AI-powered image composition, scene understanding, and video generation. Video prompts should follow commercial shot language—visual hooks, product presence, hero shots, detail showcase, function expression, and dynamic visuals.
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.
A professional skill for App Evaluation (evaluating app's performance with score) and App Improvement (giving professional suggestions for improving the app's performance).
>-
Search and summarize the latest 7 days of AI news and X discussions using public sources plus browser-based X collection. Use for recent AI news, trends, X discussions, industry briefs, and summaries organized into hot topics, viewpoints, and opportunity areas.
Analyzes and automatically optimizes existing agents by improving system prompts and tool configuration.
Creates new agents from user requirements by generating Python implementation and mcp_config.