High-performance open-source synthetic data engine. Uses LLMs for schema design and vectorized NumPy for deterministic, scalable generation.
- ✓Open-source license (MIT)
- ✓Actively maintained (<30d)
- ✓Clear description
- ✓Topics declared
git clone https://github.com/rasinmuhammed/misataTools overview
<div align="center">
<img src="public/logo.png" width="180" alt="Misata" />
# Misata
**Realistic multi-table synthetic data that conforms to the outcome you specify — exact revenue curves, fraud rates, referential integrity, and statistical structure — from a sentence, YAML, or your database. No ML model, no real data.**
[](https://pypi.org/project/misata/)
[](https://pypi.org/project/misata/)
[](https://github.com/rasinmuhammed/misata/actions)
[](LICENSE)
[](https://colab.research.google.com/github/rasinmuhammed/misata/blob/main/notebooks/quickstart.ipynb)
[](https://arxiv.org/abs/2606.08736v1)
[](https://smithery.ai/servers/misata/misata)
</div>
<!-- mcp-name: io.github.rasinmuhammed/misata -->
---
Most synthetic-data tools learn from a real dataset and imitate it. Misata works the other way: you **declare the outcome you want** : "monthly revenue rises from \$50k to \$200k," "fraud is 3% in Q1 rising to 8% by Q4," "every customer's `total_spent` equals the sum of their orders" — and Misata generates individual rows whose aggregates hit those targets **exactly**, with full referential integrity, from no source data at all.
This is *outcome-conformant generation*. The mechanism is formalised in an arXiv preprint ([2606.08736](https://arxiv.org/abs/2606.08736v1)): a closed-form method that satisfies declared aggregates to \$0.00 error, where off-the-shelf imitation synthesisers trained on the same data miss by 74–86%. Every run can also emit an **Oracle report**, a proof bundle covering referential integrity, constraints, temporal consistency, and reproducibility.
It generates from a plain-English description, a YAML schema, or an existing database schema. No machine-learning model is required. No real data is needed.
Built for:
- **Database seeding** — fill dev and staging environments with production-like data
- **Integration tests** — relational fixtures with FK integrity across every table
- **Demos and prototypes** — realistic numbers, names, and distributions, no PII
- **BI and dashboard development** — data shaped like your real domain before launch
- **Statistical method validation** — longitudinal, grouped, and multi-site datasets that pass mixed-effects models, ICC tests, and autocorrelation checks
---
## Research
Misata's exact-aggregate engine is backed by an arXiv preprint:
> **Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark**
> Muhammed Rasin — arXiv:2606.08736 (2026)
> [https://arxiv.org/abs/2606.08736v1](https://arxiv.org/abs/2606.08736v1)
The paper formalises the core claim: when you declare `"SaaS MRR from $50k in January to $200k in December"`, Misata generates individual transactions whose monthly totals match the declared curve **to exactly $0.00 error** — not approximately, but provably, via a closed-form Gamma conditional-sum mechanism (Lukacs' characterisation). Off-the-shelf imitation synthesisers trained on the very same data miss the declared monthly aggregate by 74–86%; Misata reaches exactly 0.
The paper also introduces **SpecBench** — the first benchmark measuring conformance to analytical outcomes for cold-start relational synthesis. Misata is the reference implementation.
```bibtex
@article{rasin2026declarative,
title = {Declarative Outcome-Conformant Synthesis: Exact, Closed-Form
Specification Satisfaction and a Conformance Benchmark},
author = {Rasin, Muhammed},
year = {2026},
url = {https://arxiv.org/abs/2606.08736v1}
}
```
---
## Install
```bash
pip install misata
```
Optional extras:
```bash
pip install "misata[llm]" # multi-provider LLM schema generation
pip install "misata[documents]" # PDF output via weasyprint
pip install "misata[advanced]" # SDV/CTGAN statistical synthesis
pip install "misata[mcp]" # MCP server — expose Misata to Claude, Cursor, and other AI agents
```
---
## Use Misata from Claude / Cursor / Windsurf (MCP)
Misata ships a built-in [Model Context Protocol](https://modelcontextprotocol.io) server with a clear division of labour: **the AI agent designs the schema, Misata guarantees the math.** Agents are good at knowing that a veterinary clinic needs a `species` column; Misata is good at making 50 000 rows where every foreign key resolves, every roll-up reconciles to the cent, and the same seed reproduces byte-identical output. The primary tool, `generate_from_schema`, accepts the agent's schema dict and returns the data **plus an integrity proof** — per-relationship orphan counts the agent can show you.
**1. Install:**
```bash
pip install "misata[mcp]"
```
**2. Add to Claude Desktop** (`~/Library/Application Support/Claude/claude_desktop_config.json`):
```json
{
"mcpServers": {
"misata": {
"command": "misata-mcp"
}
}
}
```
Restart Claude Desktop. Then just ask:
> *"Generate a fintech dataset with 1 000 customers, payments, and a 2% fraud rate."*
> *"Design a clinical-trials database — sites, patients, visits, adverse events — and generate 100k rows."*
> *"I need SaaS data: MRR from $50k in January, doubled by December, with a Q3 slump."*
The agent designs whatever tables the request needs (any domain — it isn't limited to Misata's built-ins), calls Misata, writes CSVs to disk, and reports back with previews and the verified integrity summary. See the [MCP guide](docs/guides/mcp.md) for Cursor/Windsurf/Zed setup and all six available tools.
---
## Quick start
```bash
misata generate \
--story "Brazilian fintech with R$ payments, CPF verification, and 3% fraud" \
--rows 1000 \
--output-dir ./demo_data
# Writes CSVs plus:
# ./demo_data/oracle_report.json
```
```python
import misata
# One sentence → multi-table DataFrame dict
tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")
print(tables["users"].head())
print(tables["subscriptions"].head())
```
```bash
# Or from the CLI
misata generate --story "A SaaS company with 5k users and 20% churn" --rows 5000
```
## Misata Oracle
The Oracle report is Misata's proof layer. It separates hard guarantees from advisory realism checks so generated data can be trusted in CI, demos, notebooks, and research comparisons.
Guaranteed checks:
- referential integrity across configured relationships
- requested row-count fulfillment
- schema validation and configured constraints
- deterministic reproducibility when a seed is set
Advisory checks:
- quality score and plausibility warnings
- privacy heuristics
- schema-vs-output fidelity score
- locale/domain fit for countries, cities, phone prefixes, and national IDs
- data-card metadata
```python
import misata
schema = misata.parse("Brazilian fintech with CPF verification", rows=1000)
tables = misata.generate_from_schema(schema)
oracle = misata.build_oracle_report(tables, schema, seed=schema.seed)
print(oracle["passed"])
print(oracle["advisory"]["locale_domain_fit"]["locale"])
```
---
## Mimic mode — clone any CSV in one call
Point `misata.mimic()` at a real dataset and get a synthetic twin that matches every column's distributions but contains none of the original rows. No schema authoring, no config.
```python
import pandas as pd
import misata
real = pd.read_csv("titanic.csv")
twin = misata.mimic(real, rows=2000, seed=42, table_name="passengers")["passengers"]
```
The profiler handles the columns that break other tools:
- **Alphanumeric code columns** (Ticket `"A/5 21171"`, Cabin `"C85"`, SKUs, reference numbers) are detected by their character-class shape and reproduced structurally — same shapes in the right proportions, entirely new values, zero verbatim leak from the source. They no longer fall through to prose text generation.
- **Floats keep their cents.** A Fare of `7.25` generates as `7.25`-shaped values. The profiler infers decimal places from the data; semantic quantization (charm pricing) never fires on mimicked columns.
- **Distributions are fit from the data.** Skewed-positive columns get lognormal; constant columns get a uniform stub; everything else gets normal. Categorical columns with fewer than 50 values carry their real frequencies.
```python
# Verify: no verbatim rows can leak through
shared = [c for c in real.columns if c in twin.columns]
overlap = pd.merge(real[shared].astype(str), twin[shared].astype(str), how="inner")
assert len(overlap) == 0
```
---
## Six ways to generate data
### 1. Plain English, no config required
```python
tables = misata.generate("A fintech startup with 10k customers, fraud rate 3%, and IBAN accounts")
```
Misata reads the story, infers domain (fintech), scale (10 000 rows), and column semantics (fraud flag, IBAN format) — no schema authoring needed.
### 2. YAML schema-as-code, commit it to git
```bash
misata init # scaffolds misata.yaml in the current directory
misata generate # reads misata.yaml automatically
```
```yaml
# misata.yaml
name: my-app
seed: 42
tables:
users:
rows: 1000
columns:
user_id: { type: int, unique: true }
email: { type: text, text_type: email }
plan: { type: categorical, choices: [free, pro, enterprise] }
orders:
rows: 5000
columns:
order_id: { type: int, unique: true }
user_id: { type: foreign_key }
amount: {What people ask about misata
What is rasinmuhammed/misata?
+
rasinmuhammed/misata is tools for the Claude AI ecosystem. High-performance open-source synthetic data engine. Uses LLMs for schema design and vectorized NumPy for deterministic, scalable generation. It has 59 GitHub stars and was last updated today.
How do I install misata?
+
You can install misata by cloning the repository (https://github.com/rasinmuhammed/misata) or following the README instructions on GitHub. ClaudeWave also provides quick install blocks on this page.
Is rasinmuhammed/misata safe to use?
+
Our security agent has analyzed rasinmuhammed/misata and assigned a Trust Score of 87/100 (tier: Trusted). See the full breakdown of passed checks and flags on this page.
Who maintains rasinmuhammed/misata?
+
rasinmuhammed/misata is maintained by rasinmuhammed. The last recorded GitHub activity is from today, with 0 open issues.
Are there alternatives to misata?
+
Yes. On ClaudeWave you can browse similar tools at /categories/tools, sorted by popularity or recent activity.
Deploy misata to your cloud
Ship this repo to production in minutes. Each platform spins up its own environment with editable env vars.
Maintain this repo? Add a badge to your README
Drop the badge into your GitHub README to show it's tracked on ClaudeWave. Each badge links back to this page and reflects the live Trust Score.
[](https://claudewave.com/repo/rasinmuhammed-misata)<a href="https://claudewave.com/repo/rasinmuhammed-misata"><img src="https://claudewave.com/api/badge/rasinmuhammed-misata" alt="Featured on ClaudeWave: rasinmuhammed/misata" width="320" height="64" /></a>More Tools
A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.
An AI SKILL that provide design intelligence for building professional UI/UX multiple platforms
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
AI coding assistant skill (Claude Code, Codex, OpenCode, Cursor, Gemini CLI, and more). Turn any folder of code, SQL schemas, R scripts, shell scripts, docs, papers, images, or videos into a queryable knowledge graph. App code + database schema + infrastructure in one graph.
A light-weight and powerful meta-prompting, context engineering and spec-driven development system for Claude Code by TÂCHES.
CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies