Skill7.9k repo starsupdated 1mo ago

data-analysis

The data-analysis skill helps users explore, clean, analyze, and visualize datasets to answer specific business or research questions. Use it when someone requests pattern detection, statistical analysis, data visualization, data cleaning, or insights from structured data like CSV files, Excel spreadsheets, or JSON datasets. It's also appropriate for A/B test analysis, cohort studies, and data quality assessments. The skill begins by clarifying what question the data should answer before performing analysis.

View source Repository: Upsonic

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Upsonic/Upsonic /tmp/data-analysis && cp -r /tmp/data-analysis/src/upsonic/skills/builtins/data-analysis ~/.claude/skills/data-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Data Analysis

Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.

## Before You Analyze

### Understand the Question

Before touching the data, clarify:

1. **What question are we answering?** ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
2. **Who needs the answer?** (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
3. **What decisions will this inform?** (This determines how precise you need to be and what format the answer should take.)
4. **What's the timeline?** (A quick sanity check and a thorough statistical analysis require different approaches.)

If the user says "analyze this data" without a specific question, help them formulate one:
- "What would be most useful to know from this data?"
- "Are you looking for trends over time, comparisons between groups, or something else?"
- "Is there a specific business question this should answer?"

## Reference Materials and Scripts

- Execute `profile_data.py` with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with `--help` for usage.
- Load `statistical-tests-guide.md` when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.

### Understand the Data

Before analysis, get your bearings:

1. **Source and context**: Where did this data come from? How was it collected? What time period does it cover?
2. **Schema**: What are the columns/fields? What do they represent? What are the data types?
3. **Scale**: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
4. **Known issues**: Is the data known to be incomplete, biased, or have quality problems?

```python
# First look at any dataset
import pandas as pd

df = pd.read_csv("data.csv")  # or read_excel, read_json, etc.
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
```

## Analysis Workflow

### Step 1: Clean and Validate

Data quality determines analysis quality. Don't skip this.

#### Handle Missing Values
- **Count them first**: What percentage of each column is missing?
- **Understand why**: Are they random? Systematic? (e.g., optional fields vs data collection failures)
- **Choose a strategy and document it**:
  - Drop rows: When missing data is rare and random (less than 5%)
  - Impute with median/mode: When missing data is moderate and the distribution is known
  - Flag as separate category: When missingness itself is informative
  - Leave as-is: When the analysis method handles nulls natively

```python
# Document your decisions
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
```

#### Handle Outliers
- **Detect**: Use IQR method, z-scores, or domain knowledge
- **Investigate**: Are they errors or legitimate extreme values?
- **Document your decision**: Keep, cap, or remove — and explain why

```python
# IQR method for outlier detection
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
```

#### Validate Data Types and Ranges
- Dates should be dates, numbers should be numbers
- Check for impossible values (negative ages, future dates, percentages over 100)
- Verify categorical values are consistent (watch for "USA", "US", "United States")

### Step 2: Explore

Start broad, then focus on what's interesting.

#### Descriptive Statistics
Always start here — understand the basics before going deeper.

```python
# Numerical columns
print(df.describe())

# Categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(10))
```

#### Distributions
Understanding shape matters for choosing the right tests.

```python
import matplotlib.pyplot as plt

# Distribution of key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
    df[col].hist(ax=axes[i], bins=30)
    axes[i].set_title(col)
    axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
    axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")
```

#### Correlations and Relationships
Look for patterns between variables.

```python
# Correlation matrix for numerical columns
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if abs(corr.iloc[i, j]) > 0.5:
            print(f"  {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")
```

#### Trends Over Time
If the data has a time dimension, always look at trends.

```python
# Time series analysis
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")
```

### Step 3: Analyze

Choose the right method for the question.

#### Comparison Questions ("Is A different from B?")

Use the right statistical test:
- **Two groups, continuous outcome**: t-test (if normal) or Mann-Whitney U (if not)
- **Multiple groups**: ANOVA (if normal) or Kruskal-Wallis (if not)
- **Two groups, categorical outcome**: Chi-squared test
- **Before/after with same subjects**: Paired t-test or Wilcoxon signed-rank

Always report:
- Sampl