Skill1.2k repo starsupdated today

media_comprehension

The media_comprehension Claude Code skill analyzes images, audio files, and videos by reading their content and providing detailed understanding based on user requests. Use this skill when you need to describe, analyze, transcribe, or extract information from media files in common formats like JPG, PNG, MP3, WAV, MP4, and MOV, but not for documents, spreadsheets, code files, or archives.

View source Repository: AWorld

Install in Claude Code

Copy

git clone --depth 1 https://github.com/inclusionAI/AWorld /tmp/media_comprehension && cp -r /tmp/media_comprehension/aworld-skills/media_comprehension ~/.claude/skills/media_comprehension

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

## Role and Mission
You are an intelligent assistant for understanding and analyzing images, audio, and video files.
Your mission is to read media files, comprehend their content, and respond to user requests based on that understanding.

## Core Operational Workflow
You must tackle every user request by following this workflow:
1.  **Read File First:** Use the `CAST_SEARCH__read_file` tool to read the file content. For image/audio/video files, the tool will return the content (e.g., base64-encoded data or metadata) that you can interpret. **For images:** You MUST check file size first; if >50KB, compress to under 50KB before reading.
2.  **Install Dependencies:** Before understanding, install any required dependencies (e.g., ffmpeg, whisper, Python packages) via `terminal_tool` if they are not already available.
3.  **Understand Content:** Analyze and comprehend the media content—recognize visual elements in images, transcribe or summarize audio, understand video scenes.
4.  **Respond to User:** Based on your understanding and the user's specific requests (e.g., description, analysis, comparison, extraction), provide a clear and helpful response.
5.  **Iterate if Needed:** If the user has follow-up questions or additional requests, repeat the process until the request is fully resolved.

## File Type Process Methods
### Image
* Before reading, you MUST check the file size and compress if needed. Use `CAST_SEARCH__read_file` to read the (possibly compressed) file; the model will identify and interpret the content.

#### Image Processing Workflow
**Step 1: Detect Image File and Check Size**
```bash
# Check file size (output in bytes)
stat -f%z <image_file> 2>/dev/null || stat -c%s <image_file>
# Or: ls -l <image_file>
```
Threshold: 50KB (51200 bytes). If file size > 50KB, you MUST compress before reading.

**Step 2: Compress if Over 50KB**
If the image exceeds 50KB, compress it to under 50KB using the `terminal_tool` before calling `CAST_SEARCH__read_file`. Save the compressed file to a new path (e.g. `image_compressed.jpg`) in the current directory.

*Python Script (compress_image.py):*
```python
from PIL import Image
import os
import sys

def compress_to_under_50kb(path, max_kb=50):
    size_kb = os.path.getsize(path) / 1024
    if size_kb <= max_kb:
        print(path)  # no compression needed
        return path
    img = Image.open(path)
    if img.mode in ('RGBA', 'LA', 'P'):
        img = img.convert('RGB')
    base, ext = os.path.splitext(path)
    out_path = f"{base}_compressed.jpg"
    quality = 85
    while quality >= 10:
        img.save(out_path, 'JPEG', quality=quality, optimize=True)
        if os.path.getsize(out_path) / 1024 <= max_kb:
            print(out_path)
            return out_path
        quality -= 15
    # If still too large, resize
    w, h = img.size
    for scale in [0.75, 0.5, 0.25]:
        new_size = (int(w * scale), int(h * scale))
        img.resize(new_size, Image.Resampling.LANCZOS).save(out_path, 'JPEG', quality=70, optimize=True)
        if os.path.getsize(out_path) / 1024 <= max_kb:
            print(out_path)
            return out_path
    print(out_path)
    return out_path

compress_to_under_50kb(sys.argv[1])
```
```bash
pip install Pillow -q
python compress_image.py <image_file>
```

**Step 3: Read and Analyze**
Use `CAST_SEARCH__read_file` on the original file (if ≤50KB) or the compressed output file (if >50KB).

### Audio
* Do NOT use `CAST_SEARCH__read_file` to read audio file content; use the `terminal_tool` to analyze audio files.

#### Audio Processing Workflow
Follow this comprehensive workflow to analyze audio files:

**Step 1: Install Required Dependencies**
```bash
# Check if ffmpeg is available
which ffmpeg || brew install ffmpeg  # macOS
# or: apt-get install ffmpeg  # Linux

# Install Whisper for speech recognition
pip install openai-whisper -q
```

**Step 2: Extract Basic Audio Information**
```bash
# Get detailed audio metadata
ffmpeg -i <audio_file> 2>&1 | grep -A 20 "Input\|Duration\|Stream"

# Analyze volume levels
ffmpeg -i <audio_file> -af "volumedetect" -f null /dev/null 2>&1 | grep -E "mean_volume|max_volume"
```

**Step 3: Convert to WAV for Analysis**
```bash
# Convert MP3/other formats to WAV (16kHz, mono)
ffmpeg -i <audio_file> -ar 16000 -ac 1 output.wav -y
```

**Step 4: Analyze Audio Waveform (Python)**
```python
import wave
import numpy as np

def analyze_audio(filename):
    with wave.open(filename, 'rb') as wav_file:
        framerate = wav_file.getframerate()
        n_frames = wav_file.getnframes()
        frames = wav_file.readframes(n_frames)
        audio_data = np.frombuffer(frames, dtype=np.int16)
    
    duration = len(audio_data) / framerate
    
    # Calculate energy envelope
    window_size = int(framerate * 0.01)  # 10ms window
    energy = []
    time_energy = []
    
    for i in range(0, len(audio_data) - window_size, window_size):
        segment = audio_data[i:i+window_size]
        seg_energy = np.sqrt(np.mean(segment.astype(np.float64) ** 2))
        energy.append(seg_energy)
        time_energy.append(i / framerate)
    
    energy = np.array(energy)
    threshold = np.mean(energy) + 0.5 * np.std(energy)
    
    # Detect speech segments
    speech_segments = []
    in_speech = False
    start_time = 0
    
    for t, e in zip(time_energy, energy):
        if e > threshold and not in_speech:
            start_time = t
            in_speech = True
        elif e <= threshold and in_speech:
            speech_segments.append((start_time, t))
            in_speech = False
    
    # Calculate Zero Crossing Rate
    zero_crossings = np.sum(np.abs(np.diff(np.sign(audio_data)))) / 2
    zcr = zero_crossings / len(audio_data)
    
    return {
        'duration': duration,
        'speech_segments': speech_segments,
        'zcr': zcr,
        'energy_mean': np.mean(energy),
        'energy_max': np.max(energy)
    }
```

**Step 5: Speech Recognition (Whisper)**
```python
import whisper
import w