A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.
{
"mcpServers": {
"openmythos": {
"command": "python",
"args": ["-m", "open-mythos"]
}
}
}~/Library/Application Support/Claude/claude_desktop_config.json (Mac) or %APPDATA%\Claude\claude_desktop_config.json (Windows).<placeholder> values with your API keys or paths.Tools overview
# OpenMythos
<p align="left">
<a href="https://pypi.org/project/open-mythos/" target="_blank">
<picture>
<source srcset="https://img.shields.io/pypi/v/open-mythos?style=for-the-badge&color=3670A0" media="(prefers-color-scheme: dark)">
<img alt="Version" src="https://img.shields.io/pypi/v/open-mythos?style=for-the-badge&color=3670A0">
</picture>
</a>
<a href="https://twitter.com/kyegomezb/">
<picture>
<source srcset="https://img.shields.io/badge/Twitter-Follow-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" media="(prefers-color-scheme: dark)">
<img src="https://img.shields.io/badge/Twitter-Follow-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Twitter">
</picture>
</a>
<a href="https://discord.gg/3keGBK9Pvr" target="_blank">
<picture>
<source srcset="https://img.shields.io/badge/Discord-Join-5865F2?style=for-the-badge&logo=discord&logoColor=white" media="(prefers-color-scheme: dark)">
<img alt="Discord" src="https://img.shields.io/badge/Discord-Join-5865F2?style=for-the-badge&logo=discord&logoColor=white">
</picture>
</a>
<a href="https://pytorch.org" target="_blank">
<picture>
<source srcset="https://img.shields.io/badge/PyTorch-Implemented-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white" media="(prefers-color-scheme: dark)">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-Implemented-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white">
</picture>
</a>
</p>
> **Disclaimer:** OpenMythos is an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic or any of their proprietary systems.
OpenMythos is an open-source, theoretical implementation of the Claude Mythos model. It implements a Recurrent-Depth Transformer (RDT) with three stages: **Prelude** (transformer blocks), a looped **Recurrent Block** (up to `max_loop_iters`), and a final **Coda**. Attention is switchable between MLA and GQA, and the feed-forward uses a sparse MoE with routed and shared experts ideal for exploring compute-adaptive, depth-variable reasoning.
## Installation
```bash
pip install open-mythos
#uv pip install open-mythos
```
To enable Flash Attention 2 in `GQAttention` (requires CUDA and build tools):
```bash
pip install open-mythos[flash]
```
## Usage
```python
import torch
from open_mythos.main import OpenMythos, MythosConfig
attn_type = "mla" # or "gqa"
base = {
"vocab_size": 1000,
"dim": 256,
"n_heads": 8,
"max_seq_len": 128,
"max_loop_iters": 4,
"prelude_layers": 1,
"coda_layers": 1,
"n_experts": 8,
"n_shared_experts": 1,
"n_experts_per_tok": 2,
"expert_dim": 64,
"lora_rank": 8,
"attn_type": attn_type,
}
if attn_type == "gqa":
cfg = MythosConfig(**base, n_kv_heads=2)
else:
cfg = MythosConfig(
**base,
n_kv_heads=8,
kv_lora_rank=32,
q_lora_rank=64,
qk_rope_head_dim=16,
qk_nope_head_dim=16,
v_head_dim=16,
)
model = OpenMythos(cfg)
total = sum(p.numel() for p in model.parameters())
print(f"\n[{attn_type.upper()}] Parameters: {total:,}")
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)
print(f"[{attn_type.upper()}] Logits shape: {logits.shape}")
out = model.generate(ids, max_new_tokens=8, n_loops=8)
print(f"[{attn_type.upper()}] Generated shape: {out.shape}")
A = model.recurrent.injection.get_A()
rho = torch.linalg.eigvals(A).abs().max().item()
print(
f"[{attn_type.upper()}] Spectral radius ρ(A) = {rho:.4f} (must be < 1)"
)
```
## Model Variants
Pre-configured scales from 1B to 1T parameters:
```python
from open_mythos import (
mythos_1b,
mythos_3b,
mythos_10b,
mythos_50b,
mythos_100b,
mythos_500b,
mythos_1t,
OpenMythos,
)
cfg = mythos_7b() # returns a MythosConfig
model = OpenMythos(cfg)
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")
```
| Variant | `dim` | Experts | `expert_dim` | Loop iters | Context | Max output |
|---|---|---|---|---|---|---|
| `mythos_1b` | 2048 | 64 | 2048 | 16 | 4k | 4k |
| `mythos_3b` | 3072 | 64 | 4096 | 16 | 4k | 4k |
| `mythos_10b` | 4096 | 128 | 5632 | 24 | 8k | 4k |
| `mythos_50b` | 6144 | 256 | 9728 | 32 | 8k | 4k |
| `mythos_100b` | 8192 | 256 | 13568 | 32 | 1M | 128k |
| `mythos_500b` | 12288 | 512 | 23040 | 48 | 1M | 128k |
| `mythos_1t` | 16384 | 512 | 34560 | 64 | 1M | 128k |
---
## Training
The training script for the 3B model on FineWeb-Edu is at [`training/3b_fine_web_edu.py`](training/3b_fine_web_edu.py).
**Single GPU:**
```bash
python training/3b_fine_web_edu.py
```
**Multi-GPU (auto-detects GPU count):**
```bash
torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/3b_fine_web_edu.py
```
Key design choices:
| Feature | Detail |
|---|---|
| Optimizer | AdamW |
| Dataset | `HuggingFaceFW/fineweb-edu` (`sample-10BT` by default, swap to `sample-100BT` or `default` for full run) |
| Tokenizer | `openai/gpt-oss-20b` via `MythosTokenizer` |
| Parallelism | PyTorch DDP via `torchrun`, sharded streaming dataset |
| Precision | bfloat16 on H100/A100, float16 + GradScaler on older GPUs |
| Schedule | Linear warmup (2000 steps) → cosine decay |
| Target | 30B tokens (~Chinchilla-adjusted for looped architecture) |
---
## Documentation
| Page | Description |
|---|---|
| [`docs/open_mythos.md`](docs/open_mythos.md) | Full API reference for the `OpenMythos` class — constructor, `forward`, `generate`, all sub-modules, configuration reference, and usage examples |
| [`docs/datasets.md`](docs/datasets.md) | Recommended training datasets with token budget guidance per model size |
---
## The Central Hypothesis
Claude Mythos is suspected to be a **Recurrent-Depth Transformer (RDT)** — also called a Looped Transformer (LT). Rather than stacking hundreds of unique layers, a subset of layers is recycled and run through multiple times per forward pass. Same weights. More loops. Deeper thinking.
This is not chain-of-thought. There is no intermediate token output. All of this reasoning happens **silently, inside a single forward pass**, in continuous latent space.
---
## Architecture
A looped transformer divides its layers into three functional blocks:
```
Input
↓
[Prelude P] — standard transformer layers, run once
↓
[Recurrent Block R] — looped T times
↑_______↓ (hidden state h updated each loop with input injection e)
↓
[Coda C] — standard transformer layers, run once
↓
Output
```
The recurrent block update rule at each loop step t:
```
h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
```
Where:
- `h_t` is the hidden state after loop t
- `e` is the encoded input (from the Prelude), injected at every loop
- `A` and `B` are learned injection parameters
- The Transformer blocks apply attention and MLP as usual
The injection of `e` at every step is what prevents the model from drifting — it keeps the original input signal alive throughout the entire recurrence depth.
The full implementation is in [`open_mythos/main.py`](open_mythos/main.py). See the [`OpenMythos` class reference](docs/open_mythos.md) for a detailed API walkthrough, configuration options, and usage examples.
### Attention Implementations
The attention layer is switchable via `cfg.attn_type`:
| Option | Class | Description |
|---|---|---|
| `"gqa"` | `GQAttention` | Grouped Query Attention (Ainslie et al., 2023) — fewer KV heads than Q heads (`n_kv_heads < n_heads`), reducing KV-cache memory by `n_heads / n_kv_heads`. Uses **Flash Attention 2** (Dao et al., 2023) when `flash-attn>=2.8.3` is installed: GQA is handled natively (no KV head expansion), I/O-bound-optimal, with a transparent fallback to manual scaled dot-product attention when the package is absent. |
| `"mla"` | `MLAttention` | Multi-Latent Attention (DeepSeek-V2) — caches a compressed KV latent (`kv_lora_rank`) rather than full K/V, with split RoPE / no-RoPE head dims for position-aware compression. |
RoPE is applied to Q and K before caching, so cached values do not need to be re-rotated on retrieval.
---
## Why This Explains Mythos
### 1. Systematic Generalization
Vanilla transformers fail to combine knowledge in ways they have never seen during training. Looped transformers pass this test. The ability emerges through a **three-stage grokking process**:
1. Memorization — model fits training distribution
2. In-distribution generalization — model handles known compositions
3. Systematic generalization — model handles novel compositions OOD, abruptly and suddenly
This is why Mythos feels qualitatively different from other models on novel questions — the capability phase-transitions in, rather than emerging gradually.
### 2. Depth Extrapolation
Train on 5-hop reasoning chains. Test on 10-hop. Vanilla transformer fails. Looped transformer succeeds — by running more inference-time loops. This maps directly to the observation that Mythos handles deeply compositional problems (multi-step math, long-horizon planning, layered arguments) without explicit chain-of-thought.
More loops at inference = deeper reasoning chains = harder problems solved.
### 3. Latent Thoughts as Implicit Chain-of-Thought
Each loop iteration is the functional equivalent of one step of chain-of-thought, but operating in continuous latent space rather than token space. A looped model running T loops implicitly simulates T steps of CoT reasoning. This has been formally proven (Saunshi et al., 2025).
Furthermore, continuous latent thoughts — unlike discrete token outputs — can encode **multiple alternative next steps simultaneously**. This allows something closer to breadth-first search over the reasoning space, rather than a single committed reasoning path. The model is effectively exploring many possible What people ask about OpenMythos
What is kyegomez/OpenMythos?
+
kyegomez/OpenMythos is tools for the Claude AI ecosystem. A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature. It has 10.8k GitHub stars and was last updated today.
How do I install OpenMythos?
+
You can install OpenMythos by cloning the repository (https://github.com/kyegomez/OpenMythos) or following the README instructions on GitHub. ClaudeWave also provides quick install blocks on this page.
Is kyegomez/OpenMythos safe to use?
+
kyegomez/OpenMythos has not been audited yet by our security agent. Review the original repository on GitHub before using it in production.
Who maintains kyegomez/OpenMythos?
+
kyegomez/OpenMythos is maintained by kyegomez. The last recorded GitHub activity is from today, with 42 open issues.
Are there alternatives to OpenMythos?
+
Yes. On ClaudeWave you can browse similar tools at /categories/tools, sorted by popularity or recent activity.
Deploy OpenMythos to your cloud
Ship this repo to production in minutes. Each platform spins up its own environment with editable env vars.
Maintain this repo? Add a badge to your README
Drop the badge into your GitHub README to show it's tracked on ClaudeWave. Each badge links back to this page and reflects the live Trust Score.
[](https://claudewave.com/repo/kyegomez-openmythos)<a href="https://claudewave.com/repo/kyegomez-openmythos"><img src="https://claudewave.com/api/badge/kyegomez-openmythos" alt="Featured on ClaudeWave — kyegomez/OpenMythos" width="320" height="64" /></a>More Tools
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.
An AI SKILL that provide design intelligence for building professional UI/UX multiple platforms
A light-weight and powerful meta-prompting, context engineering and spec-driven development system for Claude Code by TÂCHES.
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
一款 AI 驱动的低代码平台,提供"零代码"与"代码生成"双模式——零代码模式一句话搭建系统,代码生成模式自动输出前后端代码与建表 SQL,生成即可运行。平台内置 AI 聊天助手、AI大模型、知识库、AI流程编排、MCP 与插件体系,兼容主流大模型,支持一句话生成流程图、设计表单、聊天式业务操作,解决 Java 项目 80% 重复工作,高效且不失灵活。