data-wrangling
Data cleaning, transformation, reshaping, joins, missing data handling, and tidy data principles. Covers the full pipeline from raw ingestion to analysis-ready datasets -- type coercion, deduplication, outlier detection, normalization, melting/pivoting, regex extraction, and reproducible transformation chains. Use when preparing, cleaning, or transforming data for analysis.
git clone --depth 1 https://github.com/Tibsfox/gsd-skill-creator /tmp/data-wrangling && cp -r /tmp/data-wrangling/examples/skills/data-science/data-wrangling ~/.claude/skills/data-wranglingSKILL.md
# Data Wrangling Data wrangling is the work that sits between raw data and analysis -- the unglamorous, indispensable practice of making data trustworthy. Estimates vary, but practitioners consistently report that 60-80% of analysis time is spent wrangling. This skill covers the principles and techniques of data cleaning, transformation, reshaping, and integration, grounded in Hadley Wickham's tidy data framework and extended to the realities of messy real-world datasets. **Agent affinity:** tukey (EDA-driven cleaning), nightingale (routing wrangling tasks) **Concept IDs:** data-data-sources, data-data-quality, data-sampling-bias ## The Wrangling Pipeline | Stage | Goal | Key operations | |---|---|---| | 1. Ingestion | Get data into a working environment | Read CSV/JSON/Parquet/SQL, handle encodings, parse dates | | 2. Profiling | Understand what you have | Shape, dtypes, nulls, distributions, cardinality | | 3. Cleaning | Fix structural problems | Dedup, type coercion, standardize categories, fix encodings | | 4. Missing data | Handle gaps | Detect patterns (MCAR/MAR/MNAR), impute or flag | | 5. Transformation | Derive analysis-ready features | Normalize, bin, log-transform, create indicators | | 6. Reshaping | Match the analysis structure | Melt, pivot, tidy form, denormalize | | 7. Integration | Combine sources | Joins (inner/left/right/full/cross), concatenation, dedup post-join | | 8. Validation | Confirm readiness | Schema checks, assertion tests, row-count reconciliation | ## Tidy Data Principles Hadley Wickham (2014) formalized "tidy data" as three rules: 1. **Each variable forms a column.** A single column should contain values of exactly one variable. 2. **Each observation forms a row.** A single row should contain all values for exactly one observational unit. 3. **Each type of observational unit forms a table.** Mixing patient demographics and lab results in one table violates this rule. Most messy datasets violate one or more of these rules in predictable ways: | Violation | Example | Fix | |---|---|---| | Column headers are values, not variable names | Columns: `income_2020`, `income_2021`, `income_2022` | Melt to columns: `year`, `income` | | Multiple variables stored in one column | `"M-25"` encodes both sex and age | Split into `sex` and `age` columns | | Variables stored in both rows and columns | Pivot table with row headers as categories | Melt and re-pivot to tidy form | | Multiple types in one table | Patient info mixed with visit records | Normalize into two related tables | | One type spread across multiple tables | Monthly CSV files with identical schema | Concatenate with a `month` column | Tidy data is not the only valid structure -- wide formats are sometimes more efficient for computation or display. But tidy form is the canonical starting point for analysis, and most tools (ggplot2, pandas groupby, SQL aggregation) assume it. ## Cleaning Techniques ### Type Coercion Raw data arrives as strings. Coercion converts to the correct type: - **Numeric:** Strip currency symbols, commas, whitespace. Handle locale-specific decimals (`,` vs `.`). Flag non-numeric values rather than silently converting to NaN. - **Dates:** Parse with explicit format strings, never rely on automatic detection. Time zones matter -- store in UTC, display in local. - **Categorical:** Standardize case, strip whitespace, map synonyms (`"USA"`, `"US"`, `"United States"` -> `"US"`). Use controlled vocabularies where possible. - **Boolean:** Map common representations (`"yes"/"no"`, `"1"/"0"`, `"true"/"false"`, `"Y"/"N"`) to a single canonical form. ### Deduplication Exact duplicates are trivial to detect. The hard cases are near-duplicates: - **Record linkage:** When the same entity appears with slight variations (`"John Smith"` vs `"J. Smith"` vs `"SMITH, JOHN"`). Use fuzzy matching (Levenshtein distance, phonetic encoding) with a human-reviewed threshold. - **Temporal duplicates:** The same event recorded at slightly different timestamps. Define a dedup window and keep the first/last/most-complete record. - **Key discipline:** Always define what constitutes a unique observation before deduplication. A table of purchases has a different uniqueness key than a table of customers. ### Outlier Detection Outliers are not errors -- they are values that warrant investigation: - **Statistical:** Values beyond 1.5 * IQR (Tukey's fences), or beyond 3 standard deviations. These thresholds are guidelines, not laws. - **Domain-based:** A human age of 150 is an error. A human age of 95 is unusual but valid. Domain knowledge trumps statistical rules. - **Multivariate:** A value can be normal on each variable individually but extreme in combination (e.g., age 25 with 40 years of work experience). Mahalanobis distance or isolation forests detect these. **Action on outliers:** Investigate first. If the value is a data entry error, correct it. If it is a measurement error, flag it. If it is a genuine extreme value, keep it and note its influence on summary statistics. ## Missing Data ### Missing Data Mechanisms Rubin (1976) classified three mechanisms: | Mechanism | Definition | Example | Implication | |---|---|---|---| | **MCAR** | Missingness is unrelated to any variable | Lab sample randomly dropped | Safe to drop or impute; no bias | | **MAR** | Missingness depends on observed variables | High-income respondents skip income question less often | Imputation using observed predictors is valid | | **MNAR** | Missingness depends on the missing value itself | People with depression less likely to report depression severity | No imputation is fully valid; requires sensitivity analysis | ### Handling Strategies | Strategy | When to use | Trade-off | |---|---|---| | **Listwise deletion** | MCAR, small fraction missing (<5%) | Simple but loses observations | | **Pairwise deletion** | MCAR, different analyses need different subsets | Keeps more data but correlation matrices may not be posit
Major art movements and their historical context for art education. Covers 12 movements from the Renaissance to contemporary art, their defining characteristics, key artists, signature works, and the intellectual/social forces that produced them. Use when analyzing artworks in historical context, understanding stylistic lineages, identifying influences across periods, or connecting studio practice to art-historical precedent.
Color theory principles for art education. Covers the three color properties (hue, saturation, value), color mixing systems (subtractive and additive), color relationships (complementary, analogous, triadic, split-complementary), color temperature, simultaneous contrast and the relativity of color perception, and practical palette construction. Use when analyzing color in artworks, planning color schemes, understanding optical phenomena in painting, or investigating Albers's Interaction of Color experiments.
The creative process in art from idea to exhibition. Covers five phases of creative work (inspiration, incubation, exploration, execution, reflection), sketchbook practice, artist statements, critique methodology (formal and conceptual), portfolio development, and the studio as a working environment. Use when guiding students through project development, facilitating critique sessions, developing artist statements, curating portfolios, or understanding how professional artists structure their creative practice.
Digital art tools, techniques, and workflows for art education. Covers raster and vector workflows, digital painting, photo manipulation, generative and procedural art, 3D modeling and rendering, pixel art, the relationship between traditional skills and digital execution, and ethical considerations of AI-generated imagery. Use when working with digital tools, evaluating digital art, or bridging traditional art concepts into digital practice.
Observational drawing and visual perception techniques for art education. Covers contour drawing, gesture drawing, negative space, proportion and measurement, value mapping, spatial depth cues, and the cognitive shift from symbolic to perceptual seeing. Use when teaching drawing fundamentals, analyzing observational accuracy, or developing visual literacy in any medium.
Three-dimensional art and sculptural thinking for art education. Covers additive and subtractive sculptural processes, armature construction, modeling in clay, carving principles, casting and moldmaking, assemblage and found-object sculpture, installation art as expanded sculpture, and the conceptual transition from pictorial to spatial thinking. Use when working with three-dimensional media, analyzing sculptural form, understanding spatial composition, or investigating the relationship between sculpture and site.
Celestial coordinate systems and sky positioning. Covers horizon (altitude-azimuth), equatorial (right ascension-declination), ecliptic, and galactic systems; epoch and precession; coordinate transformations; planisphere use; and practical sky-locating from any latitude and date. Use when locating objects, planning observations, converting catalog coordinates, or teaching the geometry of the sky.
Observational cosmology from Hubble's law to the CMB. Covers redshift, Hubble expansion, the cosmological parameters, the cosmic microwave background, large-scale structure, galaxy rotation curves and dark matter, Type Ia SNe and dark energy, and the current state of Lambda-CDM. Use when reasoning about the large-scale universe, interpreting cosmological surveys, or teaching the Big Bang evidence chain.