Skill284 repo starsupdated 4d ago

statistical-analysis

This Claude Code skill provides guidance for selecting and executing statistical analyses across frequentist and Bayesian frameworks, including assumption verification, effect size calculation, and results reporting. Use it when designing research analyses, choosing between statistical approaches, interpreting p-values versus Bayesian credible intervals, understanding when effect sizes matter beyond statistical significance, or formatting academic research findings according to APA standards.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/statistical-analysis && cp -r /tmp/statistical-analysis/skills/biostatistics/statistical-analysis ~/.claude/skills/statistical-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Statistical Analysis

## Overview

Statistical analysis is the systematic process of selecting appropriate tests, verifying assumptions, quantifying effect magnitudes, and reporting results. This knowhow guides test selection, assumption diagnostics, and APA-style reporting for frequentist and Bayesian analyses in academic research.

## Key Concepts

### Frequentist vs Bayesian Framework

| Aspect | Frequentist | Bayesian |
|--------|-------------|----------|
| Core output | p-value, confidence interval | Posterior distribution, credible interval |
| Interpretation | "How likely is this data if H0 is true?" | "How likely is H1 given the data?" |
| Null support | Cannot support H0 (only fail to reject) | Can quantify evidence for H0 via Bayes Factor |
| Prior info | Not used | Incorporated via prior distributions |
| Sample size | Requires adequate power | Works with any sample size |
| Best for | Standard analyses, large samples | Small samples, prior info, complex models |

### Statistical vs Practical Significance

A statistically significant result (p < .05) may be trivially small in practice. Always report:
- **Effect size**: Magnitude of the effect (Cohen's d, eta-squared, r, R-squared)
- **Confidence interval**: Precision of the estimate
- **Context**: Clinical/practical relevance in the domain

### Common Effect Sizes

| Test | Effect Size | Small | Medium | Large |
|------|-------------|-------|--------|-------|
| t-test | Cohen's d | 0.20 | 0.50 | 0.80 |
| t-test (small n) | Hedges' g | 0.20 | 0.50 | 0.80 |
| ANOVA | eta-squared partial | 0.01 | 0.06 | 0.14 |
| ANOVA | omega-squared | 0.01 | 0.06 | 0.14 |
| Correlation | r | 0.10 | 0.30 | 0.50 |
| Regression | R-squared | 0.02 | 0.13 | 0.26 |
| Regression | f-squared | 0.02 | 0.15 | 0.35 |
| Chi-square | Cramer's V | 0.07 | 0.21 | 0.35 |
| Chi-square 2x2 | phi coefficient | 0.10 | 0.30 | 0.50 |

Cohen's benchmarks are guidelines, not rigid thresholds -- domain context always matters.

### Assumptions Overview

Most parametric tests require:
1. **Independence**: Observations are independent of each other
2. **Normality**: Data (or residuals) are approximately normally distributed
3. **Homogeneity of variance**: Groups have similar variances (for group comparisons)
4. **Linearity**: Relationship between variables is linear (for regression)

When assumptions are violated:
- **Normality violated, n > 30**: Proceed -- parametric tests are robust with large samples
- **Normality violated, n < 30**: Use non-parametric alternative
- **Variance heterogeneity**: Use Welch's correction (t-test) or Welch's ANOVA
- **Linearity violated**: Add polynomial terms, transform variables, or use GAMs

### Test-Specific Assumption Workflows

**T-test assumptions**: (1) Check normality per group with Shapiro-Wilk + Q-Q plots. (2) Check homogeneity with Levene's test. (3) If normality violated: Mann-Whitney U (independent) or Wilcoxon signed-rank (paired). If variance heterogeneity: use Welch's t-test.

**ANOVA assumptions**: (1) Normality per group. (2) Homogeneity via Levene's test. (3) For repeated measures: check sphericity (Mauchly's test); if violated, apply Greenhouse-Geisser (epsilon < 0.75) or Huynh-Feldt (epsilon > 0.75) correction. (4) If normality violated: Kruskal-Wallis (independent) or Friedman (repeated).

**Linear regression assumptions**: (1) Linearity via residuals-vs-fitted plot. (2) Independence via Durbin-Watson test (1.5-2.5 acceptable). (3) Homoscedasticity via Breusch-Pagan test + scale-location plot. (4) Normality of residuals via Q-Q plot + Shapiro-Wilk. (5) Multicollinearity via VIF (>10 = severe, >5 = moderate).

**Logistic regression assumptions**: (1) Independence. (2) Linearity of log-odds with continuous predictors (Box-Tidwell test). (3) No perfect multicollinearity (VIF). (4) Adequate sample size (10-20 events per predictor minimum).

### Specialized Test Categories

Beyond the main decision flowchart, several specialized test families address specific data types:

**Survival / time-to-event analysis**:
- **Log-rank test**: Compares survival curves between groups (non-parametric)
- **Cox proportional hazards**: Models time-to-event with covariates; assumes proportional hazards
- **Parametric survival models**: Weibull, exponential, log-normal for known distributional forms
- Use when outcome is time until an event (death, relapse, failure) with possible censoring

**Count outcome models**:
- **Poisson regression**: For count data where mean approximately equals variance
- **Negative binomial regression**: For overdispersed counts (variance > mean)
- **Zero-inflated models**: For excess zeros beyond what Poisson/NB predicts
- Use when outcome is a count (number of events, incidents, occurrences)

**Agreement and reliability**:
- **Cohen's kappa**: Inter-rater agreement for categorical ratings (2 raters)
- **Fleiss' kappa / Krippendorff's alpha**: Agreement for >2 raters
- **Intraclass correlation coefficient (ICC)**: Continuous ratings reliability
- **Cronbach's alpha**: Internal consistency of multi-item scales
- **Bland-Altman analysis**: Agreement between two measurement methods (continuous)
- Use when assessing measurement reliability or inter-rater consistency

**Categorical data extensions**:
- **McNemar's test**: Paired binary outcomes (2x2)
- **Cochran's Q test**: Paired binary outcomes (3+ conditions)
- **Cochran-Armitage trend test**: Ordered categories in contingency tables

## Decision Framework

### Test Selection Flowchart

```
What is your research question?
|
+-- Comparing GROUPS on a continuous outcome?
|   |
|   +-- How many groups?
|   |   +-- 2 groups
|   |   |   +-- Independent -> Independent t-test (or Mann-Whitney U)
|   |   |   +-- Paired/repeated -> Paired t-test (or Wilcoxon signed-rank)
|   |   +-- 3+ groups
|   |      +-- Independent -> One-way ANOVA (or Kruskal-Wallis)
|   |      +-- Repeated -> Repeated-measures ANOVA (or Friedman)
|   |
|   +-- Multiple factors? -> Factorial ANOVA / Mi