Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

depmap-crispr-essentiality

DepMap CRISPR gene effect (Chronos) analysis: sign convention for essentiality, per-gene NaN-safe Spearman correlation, data loading/alignment. For general NaN-safe correlation see nan-safe-correlation; for quality filtering see degenerate-input-filtering.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/depmap-crispr-essentiality && cp -r /tmp/depmap-crispr-essentiality/skills/genomics-bioinformatics/databases/depmap-crispr-essentiality ~/.claude/skills/depmap-crispr-essentiality
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# DepMap CRISPR Gene Effect Analysis Guide

## Overview

This guide covers the correct interpretation and analysis of DepMap CRISPR gene effect (Chronos) data. The most critical and common error in DepMap analyses is failing to negate the CRISPR scores when computing correlations with "essentiality." A secondary but equally damaging mistake is using bulk correlation shortcuts that mishandle per-gene NaN patterns. This guide provides the mandatory sign convention, the correct per-gene NaN-safe Spearman correlation implementation, and data loading/alignment procedures.

## Key Concepts

### DepMap CRISPR Score Convention

The CRISPR gene effect score (produced by the Chronos algorithm) quantifies how gene knockout affects cell viability:

- **Negative score**: gene knockout reduces cell viability -- the gene is **essential** for that cell line
- **Zero score**: no measurable effect on viability
- **Positive score**: gene knockout increases viability (rare, may indicate tumor-suppressive behavior)

The DepMap portal distributes these scores in the file `CRISPRGeneEffect.csv`. Each row is a cell line (DepMap ID, e.g., `ACH-000001`) and each column is a gene in the format `GENE_NAME (ENTREZ_ID)`, e.g., `A1BG (1)`.

### Essentiality Sign Interpretation

Because negative raw scores indicate essentiality, any analysis that asks about "essentiality" or "dependency" requires negating the raw CRISPR scores:

- "Correlation with essentiality" = correlation with `-CRISPRGeneEffect` (negated)
- "Higher essentiality" = more negative raw score = more positive negated score
- "Most essential gene" = gene with the most negative raw score

If you correlate expression with **raw** CRISPR scores and find 3 genes with correlation <= -0.6 and 0 genes with correlation >= 0.6, then the correct answer for "genes with strong positive correlation with essentiality" is **3**, not 0. The negative correlations with raw scores ARE the positive correlations with essentiality.

### Data Structure: CRISPRGeneEffect Format

The standard DepMap data files use a consistent structure:

- **Index**: DepMap cell line identifiers (`ACH-XXXXXX`)
- **Columns**: Gene identifiers in `GENE_NAME (ENTREZ_ID)` format
- **Values**: Floating-point scores (may contain NaN for genes not screened in a given cell line)
- **Companion files**: Expression data (`OmicsExpressionProteinCodingGenesTPMLogp1BatchCorrected.csv`) uses the same index/column format, enabling direct alignment

Different genes have different patterns of missing data across cell lines. This is because not all genes are screened in all cell lines, and quality control may remove specific gene-cell line combinations.

## Decision Framework

```
Question: How should I compute correlations with DepMap CRISPR data?
├── Does the question mention "essentiality" or "dependency"?
│   ├── Yes → Negate CRISPR scores before correlating (see Best Practices #1)
│   └── No (raw gene effect) → Use raw scores directly
├── How should I compute correlations?
│   ├── Per-gene correlation → scipy.stats.spearmanr in a loop (see Best Practices #2)
│   └── Matrix-wide correlation → AVOID; use per-gene loop instead
└── How should I handle missing data?
    ├── Pairwise NaN removal → CORRECT (see Best Practices #3)
    └── Global row/column dropping → INCORRECT; loses too much data
```

| Scenario | Recommended Approach | Rationale |
|----------|---------------------|-----------|
| Correlating expression with "essentiality" | Negate CRISPR scores, then per-gene Spearman | Sign convention requires negation; per-gene handles NaN correctly |
| Correlating expression with raw gene effect | Per-gene Spearman on raw scores | No negation needed, but NaN-safe per-gene loop still required |
| Ranking genes by essentiality across cell lines | Rank by most negative mean raw score | More negative = more essential across the panel |
| Identifying selectively essential genes | Compare score distributions across subgroups | Use per-subgroup mean/median of raw scores, then compare |
| Filtering genes before correlation | Require minimum 10 valid cell line pairs | Genes with too few observations yield unreliable correlations |

## Best Practices

1. **Always negate CRISPR scores when the analysis asks about "essentiality"**: The raw DepMap convention is that negative = essential. When a question or hypothesis refers to "essentiality," "dependency," or "gene importance," negate the scores so that higher values mean more essential. Explicitly state the sign convention in your results.

2. **Use scipy.stats.spearmanr per gene in a loop**: Bulk matrix shortcuts (`DataFrame.corrwith`, `DataFrame.rank().corrwith()`) handle NaN inconsistently across columns. The only reliable method is to compute Spearman correlation gene by gene using `scipy.stats.spearmanr` with pairwise-complete observations.

3. **Apply pairwise NaN removal, not global dropping**: Different genes have different missing-data patterns. Dropping rows globally (any NaN in any column) discards far too much data. Instead, for each gene, mask out only the cell lines where either the expression or CRISPR value is NaN.

4. **Set a minimum valid-pair threshold**: Genes with very few non-NaN cell line pairs produce unreliable correlation estimates. Require at least 10 (preferably 20+) valid pairs before computing a correlation. Skip genes below this threshold.

5. **Report NaN summary before analysis**: Before computing correlations, print the total NaN count per dataset, the number of common cell lines, and the number of common genes. This provides an audit trail and helps catch data loading errors early.

6. **Verify dataset alignment before computation**: Always intersect cell line IDs and gene columns between datasets before analysis. Misaligned indices produce silent errors -- correlations computed on mismatched rows are meaningless.

7. **State the sign convention explicitly in results**: When reporting correlation results, always include a statement like "
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-