digital-archive
The digital-archive skill provides patterns for constructing production-quality digital archives with AI-powered analysis and knowledge graph construction. Use it when integrating multiple content sources, implementing automated categorization and entity extraction, building unified data schemas, or creating searchable archives with enriched metadata from OCR, web scraping, and social media content.
git clone --depth 1 https://github.com/jamditis/claude-skills-journalism /tmp/digital-archive && cp -r /tmp/digital-archive/research-toolkit/skills/digital-archive ~/.claude/skills/digital-archiveSKILL.md
# Digital archive methodology
Patterns for building production-quality digital archives with AI-powered analysis and knowledge graph construction.
## Archive architecture
### Multi-source integration pattern
```
┌─────────────────┐ ┌──────────────────┐ ┌────────────────┐
│ OCR Pipeline │ │ Web Scraping │ │ Social Media │
│ (newspapers) │ │ (articles) │ │ (transcripts) │
└────────┬────────┘ └────────┬─────────┘ └───────┬────────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌───────────▼───────────┐
│ Unified Schema │
│ (35+ fields) │
└───────────┬───────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
┌────────▼────────┐ ┌──────────▼──────────┐ ┌───────▼───────┐
│ AI Enrichment │ │ Entity Extraction │ │ PDF Archive │
│ (Gemini) │ │ (Knowledge Graph) │ │ (WCAG 2.1) │
└────────┬────────┘ └──────────┬──────────┘ └───────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌───────────▼───────────┐
│ Google Sheets │
│ (primary database) │
└───────────┬───────────┘
│
┌───────────▼───────────┐
│ Frontend Export │
│ (JSON/CSV) │
└───────────────────────┘
```
### Unified schema design
```python
from dataclasses import dataclass, field
from datetime import date
from typing import Optional
from enum import Enum
class ContentType(Enum):
ARTICLE = 'Article'
VIDEO = 'Video'
AUDIO = 'Audio'
SOCIAL = 'Social Post'
NEWSPAPER = 'Newspaper Article'
class ThematicCategory(Enum):
PRESS_CRITICISM = 'Press & Media Criticism'
JOURNALISM_THEORY = 'Journalism Theory'
POLITICS = 'Politics & Democracy'
TECHNOLOGY = 'Technology & Digital Media'
EDUCATION = 'Journalism Education'
AUDIENCE = 'Audience & Public Engagement'
class HistoricalEra(Enum):
ERA_1990s = '1990-1999'
ERA_2000_04 = '2000-2004'
ERA_2005_09 = '2005-2009'
ERA_2010_15 = '2010-2015'
ERA_2016_20 = '2016-2020'
ERA_2021_25 = '2021-2025'
ERA_2026_PRESENT = '2026-present'
@dataclass
class ArchiveRecord:
# Core identifiers
id: str # Format: SOURCE-00001
url: str
title: str
# Content
author: Optional[str] = None
publication_date: Optional[date] = None
publication: Optional[str] = None
content_type: ContentType = ContentType.ARTICLE
text: str = ''
# AI-enriched fields
summary: Optional[str] = None
pull_quote: Optional[str] = None
categories: list[ThematicCategory] = field(default_factory=list)
key_concepts: list[str] = field(default_factory=list)
tags: list[str] = field(default_factory=list)
era: Optional[HistoricalEra] = None
scope: Optional[str] = None # Theoretical, Commentary, Case Study, etc.
# Entity references
entities_mentioned: list[str] = field(default_factory=list)
related_to: list[str] = field(default_factory=list)
responds_to: list[str] = field(default_factory=list)
# Archive metadata
pdf_url: Optional[str] = None
transcript_url: Optional[str] = None
verified: bool = False
processing_status: str = 'pending'
last_updated: Optional[date] = None
def generate_record_id(source: str, sequence: int) -> str:
"""Generate unique ID with source prefix."""
prefixes = {
'nytimes': 'NYT',
'columbia journalism review': 'CJR',
'pressthink': 'PT',
'twitter': 'TW',
'youtube': 'YT',
'newspaper': 'NEWS',
}
prefix = prefixes.get(source.lower(), 'MISC')
return f"{prefix}-{sequence:05d}"
```
## AI-powered categorization
### Taxonomy-based classification
```python
# pip install google-genai
# (the legacy `google-generativeai` SDK was deprecated in 2024 — the
# new `google-genai` package is the supported path. Imports below
# use the new shape.)
import os
from google import genai
from google.genai import types
import json
from typing import Optional
# Default to the current Gemini 2.5 family. For 2026 production
# workloads, the Gemini 3 family (gemini-3-flash, gemini-3-pro) is
# also available — bump the model string when you've verified the
# response shape against your taxonomy prompts.
DEFAULT_GEMINI_MODEL = 'gemini-2.5-flash'
# Single client; reads GOOGLE_API_KEY (or pass api_key=...).
_client = genai.Client(api_key=os.environ.get('GOOGLE_API_KEY'))
TAXONOMY = {
"thematic_categories": [
"Press & Media Criticism",
"Journalism Theory",
"Politics & Democracy",
"Technology & Digital Media",
"Journalism Education",
"Audience & Public Engagement"
],
"key_concepts": [
"The View from Nowhere",
"Verification vs. Assertion",
"Citizens vs. Consumers",
"Public Journalism",
"The Rosen Test",
"Savvy vs. Naive",
"Professional vs. Amateur",
"Production vs. Distribution",
"Trust vs. Transparency",
"Horse Race Coverage",
"Both Sides Journalism",
"Audience Atomization",
"The Church of the Savvy"
],
"scope_types": [
"Theoretical",
"Commentary",
"Historical",
"Case Study",
"Pedagogical",
"Personal Reflection"
]
}
class ArchiveCategorizer:
def __init__(self, model: str = DEFAULT_GEMINI_MODEL, client: genai.Client = None):
self.model = model
self.client = client or _Web accessibility patterns for news sites, journalism tools, and academic platforms. Use when building accessible interfaces, auditing existing sites for WCAG compliance, writing alt text for news images, creating accessible data visualizations, or ensuring content reaches all readers including those using assistive technologies. Essential for newsroom developers and anyone publishing web content.
Electron desktop application development with React, TypeScript, and Vite. Use when building desktop apps, implementing IPC communication, managing windows/tray, handling PTY terminals, integrating WebRTC/audio, or packaging with electron-builder. Covers patterns from AudioBash, Yap, and Pisscord projects.
Remote JavaScript console access and debugging on mobile devices. Use when debugging web pages on phones/tablets, accessing console errors without desktop DevTools, testing responsive designs on real devices, or diagnosing mobile-specific issues. Covers Eruda, vConsole, Chrome/Safari remote debugging, and cloud testing platforms.
Use this skill when creating new files that represent architectural decisions — data models, infrastructure configs, auth boundaries, API contracts, CI/CD pipelines, or event systems. Flags irreversible decisions and forces a discussion about trade-offs before committing.
Python data processing pipelines with modular architecture. Use when building content processing workflows, implementing dispatcher patterns, integrating Google Sheets/Drive APIs, or creating batch processing systems. Covers patterns from rosen-scraper, image-analyzer, and social-scraper projects.
This skill should be used when the user reports a bug, describes unexpected behavior, says something is "broken", "not working", "failing", mentions an "error", "issue", or "problem" in code, or asks to "fix" something. Enforces test-driven bug fixing workflow.
Methodology for effective AI-assisted software development. Use when helping users build software with AI coding assistants, debugging AI-generated code, planning features for AI implementation, managing version control in AI workflows, or when users mention "vibe coding," Claude Code, Cursor, GitHub Copilot, Aider, Continue, Cline, Codex, Windsurf, or similar AI coding tools. Provides strategies for planning, testing, debugging, and iterating on code written with LLM assistance.
Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.