Skill102 estrellas del repoactualizado today
developing-incremental-models
|
Instalar en Claude Code
Copiargit clone --depth 1 https://github.com/AltimateAI/data-engineering-skills /tmp/developing-incremental-models && cp -r /tmp/developing-incremental-models/skills/dbt/developing-incremental-models ~/.claude/skills/developing-incremental-modelsDespués abre una sesión nueva de Claude Code; el skill carga automáticamente.
Definición
SKILL.md
# dbt Incremental Model Development
**Choose the right strategy. Design the unique_key carefully. Handle edge cases.**
## When to Use Incremental
| Scenario | Recommendation |
|----------|----------------|
| Source data < 10M rows | Use `table` (simpler, full refresh is fast) |
| Source data > 10M rows | Consider `incremental` |
| Source data updated in place | Use `incremental` with `merge` strategy |
| Append-only source (logs, events) | Use `incremental` with `append` strategy |
| Partitioned warehouse data | Use `insert_overwrite` if supported |
**Default to `table` unless you have a clear performance reason for incremental.**
## Critical Rules
1. **ALWAYS test with `--full-refresh` first** before relying on incremental logic
2. **ALWAYS verify unique_key is truly unique** in both source and target
3. **If merge fails 3+ times**, check unique_key for duplicates
4. **Run full refresh periodically** to prevent data drift
## Workflow
### 1. Confirm Incremental is Needed
```bash
# Check source table size
dbt show --inline "select count(*) from {{ source('schema', 'table') }}"
```
If count < 10 million, consider using `table` instead. Incremental adds complexity.
### 2. Understand the Source Data Pattern
Before choosing a strategy, answer:
- **Is data append-only?** (new rows added, never updated)
- **Are existing rows updated?** (need merge/upsert)
- **Is there a reliable timestamp?** (for filtering new data)
- **What's the unique identifier?** (for merge matching)
```bash
# Check for timestamp column
dbt show --inline "
select
min(updated_at) as earliest,
max(updated_at) as latest,
count(distinct date(updated_at)) as days_of_data
from {{ source('schema', 'table') }}
"
```
### 3. Choose the Right Strategy
| Strategy | Use When | How It Works |
|----------|----------|--------------|
| `append` | Data is append-only, no updates | INSERT only, no deduplication |
| `merge` | Data can be updated | MERGE/UPSERT by unique_key |
| `delete+insert` | Data updated in batches | DELETE matching rows, then INSERT |
| `insert_overwrite` | Partitioned tables (BigQuery, Spark) | Replace entire partitions |
**Default:** `merge` is safest for most use cases.
**Note:** Strategy availability varies by adapter. Check the [dbt incremental strategy docs](https://docs.getdbt.com/docs/build/incremental-strategy) for your specific warehouse.
### 4. Design the Unique Key
**CRITICAL: unique_key must be truly unique in your data.**
```bash
# Verify uniqueness BEFORE creating model
dbt show --inline "
select {{ unique_key_column }}, count(*)
from {{ source('schema', 'table') }}
group by 1
having count(*) > 1
limit 10
"
```
If duplicates exist:
- Add more columns to make composite key
- Add deduplication logic in model
- Use `delete+insert` instead of `merge`
### 5. Write the Incremental Model
```sql
{{
config(
materialized='incremental',
incremental_strategy='merge', -- or append, delete+insert
unique_key='id', -- MUST be unique
on_schema_change='append_new_columns' -- handle new columns
)
}}
select
id,
column_a,
column_b,
updated_at
from {{ source('schema', 'table') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
```
### 6. Build with Full Refresh First
**ALWAYS verify with full refresh before trusting incremental logic.**
```bash
# First run: full refresh to establish baseline
dbt build --select <model_name> --full-refresh
# Verify output
dbt show --select <model_name> --limit 10
dbt show --inline "select count(*) from {{ ref('model_name') }}"
```
### 7. Test Incremental Logic
```bash
# Run incrementally (no --full-refresh)
dbt build --select <model_name>
# Verify row count changed appropriately
dbt show --inline "select count(*) from {{ ref('model_name') }}"
```
### 8. Handle Schema Changes
Set `on_schema_change` based on your needs:
| Setting | Behavior |
|---------|----------|
| `ignore` (default) | New columns in source are ignored |
| `append_new_columns` | New columns added to target |
| `sync_all_columns` | Target schema matches source exactly |
| `fail` | Error if schema changes |
## Common Incremental Problems
### Problem: Merge Fails with Duplicate Key
**Symptom:** "Cannot MERGE with duplicate values"
**Cause:** Multiple rows with same unique_key in source or target.
**Fix:**
```sql
-- Add deduplication using a CTE (cross-database compatible)
with deduplicated as (
select *,
row_number() over (partition by id order by updated_at desc) as rn
from {{ source('schema', 'table') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select * from deduplicated where rn = 1
```
### Problem: No Partition Pruning (Full Table Scan)
**Symptom:** Incremental runs take as long as full refresh.
**Cause:** Dynamic date filter prevents partition pruning.
**Fix:**
```sql
{% if is_incremental() %}
-- Use static date instead of subquery for partition pruning
where updated_at >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }}
and updated_at > (select max(updated_at) from {{ this }})
{% endif %}
```
### Problem: Late-Arriving Data is Missed
**Symptom:** Some records never appear in incremental model.
**Cause:** Filtering by max(updated_at) misses late arrivals.
**Fix:** Use a lookback window with a fixed offset from current date:
```sql
{% if is_incremental() %}
-- Lookback 3 days to catch late-arriving data
where updated_at >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }}
{% endif %}
```
Alternatively, use a variable for the lookback period:
```sql
{% set lookback_days = 3 %}
{% if is_incremental() %}
where updated_at >= {{ dbt.dateadd('day', -lookback_days, dbt.current_timestamp()) }}
{% endif %}
```
### Problem: Schema Drift Causes Errors
**Symptom:** "Column X not found" after source adds column.
**Fix:** Set `on_schema_change='appDel mismo repositorio
New Skill ProposalSkill
altimate-codeSkill
Delegates data engineering tasks to altimate-code, a specialized CLI agent with 100+ purpose-built data tools — SQL analysis, column-level lineage, dbt build/test/run, warehouse profiling, FinOps, and connectivity to Snowflake, BigQuery, Redshift, Databricks, Postgres, MySQL, DuckDB. Use this skill when the task needs live warehouse access, column lineage, multi-step data exploration, dbt builds against a real warehouse, or when the user explicitly invokes "altimate", "altimate-code", or "the data agent".
creating-dbt-modelsSkill
|
debugging-dbt-errorsSkill
|
documenting-dbt-modelsSkill
|
migrating-sql-to-dbtSkill
|
refactoring-dbt-modelsSkill
|
testing-dbt-modelsSkill
|