data-analysis
The data-analysis skill helps users explore, clean, analyze, and visualize datasets to answer specific business or research questions. Use it when someone requests pattern detection, statistical analysis, data visualization, data cleaning, or insights from structured data like CSV files, Excel spreadsheets, or JSON datasets. It's also appropriate for A/B test analysis, cohort studies, and data quality assessments. The skill begins by clarifying what question the data should answer before performing analysis.
git clone --depth 1 https://github.com/Upsonic/Upsonic /tmp/data-analysis && cp -r /tmp/data-analysis/src/upsonic/skills/builtins/data-analysis ~/.claude/skills/data-analysisSKILL.md
# Data Analysis
Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.
## Before You Analyze
### Understand the Question
Before touching the data, clarify:
1. **What question are we answering?** ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
2. **Who needs the answer?** (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
3. **What decisions will this inform?** (This determines how precise you need to be and what format the answer should take.)
4. **What's the timeline?** (A quick sanity check and a thorough statistical analysis require different approaches.)
If the user says "analyze this data" without a specific question, help them formulate one:
- "What would be most useful to know from this data?"
- "Are you looking for trends over time, comparisons between groups, or something else?"
- "Is there a specific business question this should answer?"
## Reference Materials and Scripts
- Execute `profile_data.py` with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with `--help` for usage.
- Load `statistical-tests-guide.md` when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.
### Understand the Data
Before analysis, get your bearings:
1. **Source and context**: Where did this data come from? How was it collected? What time period does it cover?
2. **Schema**: What are the columns/fields? What do they represent? What are the data types?
3. **Scale**: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
4. **Known issues**: Is the data known to be incomplete, biased, or have quality problems?
```python
# First look at any dataset
import pandas as pd
df = pd.read_csv("data.csv") # or read_excel, read_json, etc.
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
```
## Analysis Workflow
### Step 1: Clean and Validate
Data quality determines analysis quality. Don't skip this.
#### Handle Missing Values
- **Count them first**: What percentage of each column is missing?
- **Understand why**: Are they random? Systematic? (e.g., optional fields vs data collection failures)
- **Choose a strategy and document it**:
- Drop rows: When missing data is rare and random (less than 5%)
- Impute with median/mode: When missing data is moderate and the distribution is known
- Flag as separate category: When missingness itself is informative
- Leave as-is: When the analysis method handles nulls natively
```python
# Document your decisions
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
```
#### Handle Outliers
- **Detect**: Use IQR method, z-scores, or domain knowledge
- **Investigate**: Are they errors or legitimate extreme values?
- **Document your decision**: Keep, cap, or remove — and explain why
```python
# IQR method for outlier detection
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
```
#### Validate Data Types and Ranges
- Dates should be dates, numbers should be numbers
- Check for impossible values (negative ages, future dates, percentages over 100)
- Verify categorical values are consistent (watch for "USA", "US", "United States")
### Step 2: Explore
Start broad, then focus on what's interesting.
#### Descriptive Statistics
Always start here — understand the basics before going deeper.
```python
# Numerical columns
print(df.describe())
# Categorical columns
for col in df.select_dtypes(include='object').columns:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(10))
```
#### Distributions
Understanding shape matters for choosing the right tests.
```python
import matplotlib.pyplot as plt
# Distribution of key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
df[col].hist(ax=axes[i], bins=30)
axes[i].set_title(col)
axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")
```
#### Correlations and Relationships
Look for patterns between variables.
```python
# Correlation matrix for numerical columns
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
for j in range(i+1, len(corr.columns)):
if abs(corr.iloc[i, j]) > 0.5:
print(f" {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")
```
#### Trends Over Time
If the data has a time dimension, always look at trends.
```python
# Time series analysis
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")
```
### Step 3: Analyze
Choose the right method for the question.
#### Comparison Questions ("Is A different from B?")
Use the right statistical test:
- **Two groups, continuous outcome**: t-test (if normal) or Mann-Whitney U (if not)
- **Multiple groups**: ANOVA (if normal) or Kruskal-Wallis (if not)
- **Two groups, categorical outcome**: Chi-squared test
- **Before/after with same subjects**: Paired t-test or Wilcoxon signed-rank
Always report:
- SamplUse this agent when you need to create unit tests for your code in unittest.TestCase format, organized in a tests folder with concept-based subfolders. Examples: <example>Context: User has just written a new authentication module and needs comprehensive unit tests. user: 'I just finished writing my user authentication functions in auth.py. Can you help me create unit tests for them?' assistant: 'I'll use the unittest-generator agent to create comprehensive unit tests for your authentication module.' <commentary>Since the user needs unit tests created for their authentication code, use the unittest-generator agent to create properly structured tests in the tests folder with appropriate subfolder organization.</commentary></example> <example>Context: User has implemented new data validation functions and wants to ensure they're properly tested. user: 'I've added several validation functions to my utils.py file. I need unit tests to make sure they handle edge cases correctly.' assistant: 'Let me use the unittest-generator agent to create thorough unit tests for your validation functions.' <commentary>The user needs unit tests for their validation functions, so use the unittest-generator agent to create comprehensive tests with edge case coverage.</commentary></example>