Skill841 repo starsupdated 12d ago

matlab-use-duckdb

This skill generates MATLAB code for DuckDB database operations using Database Toolbox, enabling SQL-based analytics on CSV, Parquet, JSON, and Excel files without external server configuration. Use it when connecting to DuckDB databases, querying files with SQL, preprocessing out-of-memory data, or replacing MATLAB file I/O bottlenecks like readtable and parquetread with direct DuckDB reads.

View source Repository: matlab-agentic-toolkit

Install in Claude Code

Copy

git clone --depth 1 https://github.com/matlab/matlab-agentic-toolkit /tmp/matlab-use-duckdb && cp -r /tmp/matlab-use-duckdb/skills-catalog/reporting-and-database-access/matlab-use-duckdb ~/.claude/skills/matlab-use-duckdb

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# MATLAB Database Toolbox Interface to DuckDB

Use when working with DuckDB databases from MATLAB using Database Toolbox. DuckDB is an embedded analytical database engine that ships with Database Toolbox starting in R2026a. It enables SQL-based analytics on files, out-of-memory data preprocessing, and portable development databases — all without external database server configuration.

## When to Use This Skill

- Connecting to a DuckDB database (in-memory or file-based)
- Creating a new DuckDB database file for development workflows
- Querying CSV, Parquet, or JSON files directly with SQL
- Preprocessing large data that doesn't fit in memory before importing into MATLAB
- Using DuckDB as an analytical engine for filtering, aggregation, joins, or sorting
- Installing and using DuckDB extensions
- Converting existing MATLAB array/table operations into equivalent DuckDB SQL queries
- Optimizing a data pipeline that reads files into MATLAB memory before processing
- Replacing MATLAB file I/O bottlenecks (`readtable`, `readmatrix`, `parquetread`, `xlsread`, `csvread`) with direct DuckDB file reads
- User mentions keywords: DuckDB, duckdb, analytical engine, embedded database, parquet, CSV analytics, in-memory database, portable database, development database, out-of-memory preprocessing, optimize data pipeline, convert to SQL, replace readtable, replace parquetread, file bottleneck

## When NOT to Use

- Connecting to MySQL, PostgreSQL, SQLite, or other external databases — use their native interfaces or JDBC/ODBC
- Data fits in memory and only needs standard MATLAB operations — use `readtable`/`readmatrix` directly
- Object-relational mapping — use ORM (`ormread`/`ormwrite` with `Mappable` classes)
- MongoDB, Cassandra, or Neo4j — use their dedicated Database Toolbox interfaces

## On-Load Protocol

When this skill is loaded into a session where code already exists:

1. **Audit the data pipeline** — Identify how data enters the workflow:
   - Is data read via MATLAB I/O (`readtable`, `readmatrix`, `readcell`, `readtimetable`, `parquetread`, `xlsread`, `csvread`)?
   - Is data then written to DuckDB with `sqlwrite` before querying?
   - If yes: this is the **load-then-query anti-pattern**. DuckDB can likely read the source file directly via `read_csv`, `read_parquet`, or `read_xlsx` (excel extension).

2. **Evaluate each data source** against the decision framework:
   - Can DuckDB read this file type directly? (CSV, Parquet, JSON, Excel via extension)
   - Can filtering/aggregation be pushed into the SQL read?
   - Is the MATLAB I/O step a performance bottleneck?

3. **Recommend architectural changes** — Do not limit review to API correctness. Propose replacing `readtable`/`xlsread` + `sqlwrite` + query chains with `fetch(conn, "SELECT ... FROM read_csv/read_parquet/read_xlsx(...)")`.

The highest-value patterns in this skill are architectural: file-analytics pushdown eliminates entire pipeline stages and can yield 10x+ speedups.

## What Is DuckDB and Why Does Database Toolbox Ship It?

DuckDB is an embedded, serverless analytical database engine. Unlike MySQL or PostgreSQL, it requires no server, no configuration, and runs in-process within MATLAB.

**Why it ships with Database Toolbox (R2026a+):**
- **Zero-config database** — `conn = duckdb()` gives you a full SQL engine instantly.
- **Analytical engine for files** — Query CSV, Parquet, and JSON files directly with SQL without loading them into memory.
- **Out-of-memory preprocessing** — Filter, aggregate, join, and sort datasets larger than memory, then bring only results into MATLAB.
- **Portable development databases** — `.duckdb` or `.db` files work on any machine with Database Toolbox. No database setup needed.
- **AI agent advantage** — An agent's SQL knowledge directly translates to powerful analytical queries.

DuckDB does **NOT replace** MATLAB's file I/O (`readtable`, etc.). It is a performant alternative when data exceeds memory or SQL operations are more natural than MATLAB table operations.

## Critical Rules

### Connection
- **ALWAYS** use `duckdb()` to connect — not `database()`, not JDBC, not ODBC.
- **ALWAYS** verify with `isopen(conn)` and close with `close(conn)`.

### API Surface
- All standard functions work: `sqlread`, `fetch`, `execute`, `sqlwrite`, `sqlfind`, `sqlinnerjoin`, `sqlouterjoin`, `commit`, `rollback`.
- DuckDB does **NOT** support `databasePreparedStatement`. Use `execute` or `sqlwrite` instead.
- Use `ExcludeDuplicates` via `databaseImportOptions` when reading from database tables (with `sqlread`). For direct file queries (`read_csv`/`read_parquet` via `fetch`), use `SELECT DISTINCT` in SQL.

### File Queries
- **ALWAYS** use `fetch` (not `sqlread`) for file queries — they require SQL syntax like `SELECT * FROM read_csv('file.csv')`.
- **ALWAYS** use single quotes for file paths inside SQL: `read_csv('data.csv')`.

## Decision Framework

> Which connection mode should I use?

| Goal | Connection | Why |
|------|-----------|-----|
| Analytical queries on files | `duckdb()` | No persistence needed; query files directly |
| Temporary workspace | `duckdb()` | Fast, discarded on close |
| Portable development database | `duckdb("mydata.duckdb")` | Creates a `.duckdb` or `.db` file; works on any machine |
| Open existing database | `duckdb("existing.db")` | Read/write access to pre-existing `.db` or `.duckdb` file |
| Read-only shared database | `duckdb("shared.duckdb", ReadOnly=true)` | Prevents accidental writes |

> When should I use DuckDB vs. MATLAB file I/O?

| Scenario | Recommendation |
|----------|---------------|
| Small data, simple operations | `readtable` / `readmatrix` |
| Data exceeds memory, needs filtering/aggregation | DuckDB (preprocess in SQL, analyze in MATLAB) |
| Query across multiple CSV/Parquet files | DuckDB with glob patterns |
| Portable development database | DuckDB file-based connection |
| MATLAB-specific analysis (signal processing, ML) | Preprocess in DuckDB, analyze in MAT

More from this repository

matlab-train-networkSkill

matlab-driving-data-importerSkill

Import recorded driving sensor data (GPS, camera, lidar, actor tracks, lanes) into scenariobuilder.* objects (GPSData, CameraData, LidarData, ActorTrackData, Trajectory, laneData) and run preprocessing — synchronize, offset correction, crop, normalizeTimestamps, convertTimestamps. Also: compute actor tracks from lidar when no annotations exist, attach camera/lidar mounting + intrinsics, export to MAT/workspace/timetable/script. Use for raw driving dataset files (KITTI, nuScenes, Waymo, Pandaset, ROS/ROS2 bags, .mat, .csv, .mp4) or driving/vehicle/sensor logs that need wrapping. drivingLogAnalyzer (DLA) is OPT-IN ONLY — invoke only on explicit user request ('DLA', 'open in DLA', 'inspect/explore/analyze the recording') or reported sensor problem (sync drift, timestamp mismatch, overlay misalignment). NEVER auto-launch DLA after wrapping (Rule 0). For 'build scenario / export to RoadRunner / drivingScenario / OpenSCENARIO / Unreal / simulate', hand off to matlab-scenario-builder.

matlab-scenario-builderSkill

Generate driving scenes, scenarios, road surfaces, and 3D content from already-wrapped scenariobuilder.* sensor data (GPS, camera, lidar, actor tracks) using Scenario Builder for Automated Driving Toolbox. Use to BUILD, EXPORT, or AUGMENT a virtual scenario/scene/map: ego or actor trajectories, trajectory smoothing, OpenCRG road-surface extraction, 3D asset generation, static-object placement, point-cloud georeferencing + elevation, lane-based ego localization, sensor-fusion tracking, scenario-event extraction (cut-ins, hard brakes, near-misses, ADAS disengagements), or export to RoadRunner, drivingScenario, OpenDRIVE, OpenCRG, OpenSCENARIO, or Unreal Engine. Also: log-to-scenario, scenario harvesting, accident/near-miss reconstruction, SOTIF (ISO 21448) and ISO 26262 scenario coverage, USGS-aerial-lidar scene augmentation, traffic-sign placement from camera+lidar logs. NOT for raw-data import or multi-sensor sync/crop/offset/timestamp normalization — route those to matlab-driving-data-importer.

roadrunner-asset-mappingSkill

roadrunner-convert-lanelet2-to-rrhdSkill

roadrunner-import-sceneSkill

roadrunner-rrhd-authoringSkill

matlab-build-simbiology-modelSkill

Build, modify, and diagram SimBiology models — API reference, helper functions, and layout patterns. Use when constructing or editing models programmatically or visually.