heavy-file-ingestion-codex
Heavy File Ingestion Codex converts bulky documents like PDFs, Word files, and spreadsheets into lightweight markdown or CSV formats before analysis, then generates an index to guide further processing. Use this skill when users request reading, analyzing, or extracting data from large structured files to avoid wasting model tokens on raw document ingestion and instead work with compressed, pre-processed artifacts.
git clone --depth 1 https://github.com/NateBJones-Projects/OB1 /tmp/heavy-file-ingestion-codex && cp -r /tmp/heavy-file-ingestion-codex/skills/heavy-file-ingestion/variants/codex ~/.claude/skills/heavy-file-ingestion-codexSKILL.md
# Heavy File Ingestion For Codex ## Problem Codex can run local commands and inspect files, so direct ingestion of bulky documents is usually the wrong move. Convert first, index second, reason last. ## Trigger Conditions - The user asks to read or summarize a heavyweight document or spreadsheet - The file is large, structured, or expensive enough that raw ingestion is wasteful - The task would be better served by markdown, CSV, or a quick file map ## Process 1. Do not open the raw heavyweight file as your first move if a deterministic conversion path exists. 1. Run the bundled converter from this skill directory: ```bash python scripts/convert_heavy_file.py /absolute/path/to/file.ext ``` 1. If the environment is clean and needs packages, prefer: ```bash uv run \ --with pdfplumber \ --with python-docx \ --with python-pptx \ --with openpyxl \ python scripts/convert_heavy_file.py /absolute/path/to/file.ext ``` 1. Read `index.md` first, not the original file. 2. Follow the index recommendation: - `read_extracted_artifact`: inspect the generated markdown or CSV - `cheap_model_or_stronger_converter`: retry with a better deterministic tool or use a cheaper model on the extracted artifact only - `manual_review`: tell the user the deterministic route failed and propose the next cheapest fallback 3. Use expensive model context only after the file has already been compressed into a smaller artifact. ## Client Rules - Keep the main model out of raw PDFs, decks, and spreadsheets whenever possible. - Use the generated `.ob1/` folder as the working directory for follow-up analysis. - For spreadsheets, reason from the CSV per sheet plus the workbook manifest. - For presentations, reason from the slide outline before asking for a deeper pass. ## Bundled References - `references/open-source-stack.md` explains the tool choices and fallback tiers.
Use Nate Jones OB1 Agent Memory from OpenClaw with provenance, scope, review, and use-policy discipline.
Continuous learning system that extracts reusable knowledge from work sessions. Triggers: (1) /aiception command, (2) 'save this as a skill' or 'extract a skill from this', (3) 'what did we learn?', (4) after non-obvious debugging or trial-and-error discovery. Creates new skills when valuable reusable knowledge is identified. Integrates with Open Brain to prevent duplicates.
Morning digest of yesterday's Open Brain thoughts, drafted to Gmail
Generate infographic images from any research doc, Open Brain thoughts, or analysis. Auto-chunks content, writes prompts, generates images via Gemini API (free tier), and saves to media/. Use --premium for better text rendering.
|
Use when processing voice transcripts, brain dumps, stream-of-consciousness notes, or any raw multi-topic capture. Extracts every idea thread, then evaluates each one with deep brainstorming, then captures results to Open Brain. Trigger on transcripts, exports, "process this", "pan for gold", "brain dump", "what did I say", or multi-topic markdown files.
|