Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

cellxgene-census

Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/cellxgene-census && cp -r /tmp/cellxgene-census/skills/genomics-bioinformatics/single-cell/cellxgene-census ~/.claude/skills/cellxgene-census
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# CZ CELLxGENE Census

## Overview

CZ CELLxGENE Census provides programmatic access to 61+ million standardized single-cell RNA-seq observations from human and mouse. It enables population-scale queries by cell type, tissue, disease, and donor metadata, returning expression data as AnnData objects or PyTorch dataloaders for ML workflows.

## When to Use

- Querying single-cell expression data across tissues, diseases, or cell types from a curated atlas
- Building reference datasets for cell type classification or marker gene discovery
- Training ML models on large-scale single-cell data (PyTorch integration)
- Comparing gene expression across conditions (e.g., COVID-19 vs healthy) at population scale
- Exploring what single-cell datasets are available for a tissue or disease of interest
- For **analyzing your own scRNA-seq data**, use scanpy instead
- For **manipulating AnnData objects** (subsetting, concatenation), use anndata instead

## Prerequisites

```bash
pip install cellxgene-census
# For ML workflows
pip install cellxgene-census[experimental]
```

**API Rate Limits**: Census uses TileDB-SOMA cloud backend. No explicit rate limit, but large queries (>1M cells) should use out-of-core processing (Module 4) to avoid memory exhaustion. Always use context managers for proper resource cleanup.

## Quick Start

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Get B cells from lung
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
        obs_column_names=["cell_type", "disease", "donor_id"],
    )
    print(f"Retrieved {adata.n_obs} cells × {adata.n_vars} genes")
    # Retrieved ~15000 cells × 60664 genes
```

## Core API

### 1. Opening and Exploring the Census

Connect to Census and discover available data.

```python
import cellxgene_census

# Open latest stable version (always use context manager)
with cellxgene_census.open_soma() as census:
    # Summary statistics
    summary = census["census_info"]["summary"].read().concat().to_pandas()
    print(f"Total cells: {summary['total_cell_count'][0]:,}")

    # List all datasets
    datasets = census["census_info"]["datasets"].read().concat().to_pandas()
    print(f"Total datasets: {len(datasets)}")
    print(datasets[["dataset_title", "cell_count"]].head())
```

```python
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
    # Reproducible analysis code here
    pass
```

### 2. Cell Metadata Queries

Query cell-level metadata without downloading expression data.

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Get unique cell types in brain
    cell_metadata = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["cell_type", "disease", "assay"]
    )
    print(f"Total brain cells: {len(cell_metadata):,}")
    print(cell_metadata["cell_type"].value_counts().head(10))
```

```python
    # Gene metadata query
    gene_metadata = cellxgene_census.get_var(
        census,
        "homo_sapiens",
        value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
        column_names=["feature_id", "feature_name", "feature_length"]
    )
    print(gene_metadata)
    # Returns DataFrame with Ensembl IDs, gene symbols, and lengths
```

### 3. Expression Data Queries (Small-Medium Scale)

Retrieve expression matrices as AnnData objects for queries returning <100k cells.

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Query by cell type + tissue + disease
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
        obs_column_names=["cell_type", "tissue_general", "donor_id"],
    )
    print(f"Shape: {adata.shape}")  # (n_cells, 4)
    print(f"Metadata columns: {list(adata.obs.columns)}")
```

**Filter syntax reference**:
- Combine conditions: `and`, `or`
- Multiple values: `feature_name in ['CD4', 'CD8A']`
- Comparison: `cell_count > 1000`
- Always include `is_primary_data == True` to avoid duplicate cells

### 4. Large-Scale Out-of-Core Queries

Stream expression data in chunks for queries exceeding available RAM.

```python
import cellxgene_census
import tiledbsoma as soma

with cellxgene_census.open_soma() as census:
    # Estimate query size first
    metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["soma_joinid"]
    )
    n_cells = len(metadata)
    print(f"Query will return {n_cells:,} cells")

    # If >100k cells, use streaming
    query = census["census_data"]["homo_sapiens"].axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(
            value_filter="tissue_general == 'brain' and is_primary_data == True"
        ),
        var_query=soma.AxisQuery(
            value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
        )
    )

    # Incremental statistics
    n_obs, total = 0, 0.0
    for batch in query.X("raw").tables():
        values = batch["soma_data"].to_numpy()
        n_obs += len(values)
        total += values.sum()

    print(f"Processed {n_obs:,} non-zero entries, mean={total/n_obs:.4f}")
```

### 5. Dataset Presence Matrix

Check which datasets measured specific genes (not all genes are in all datasets).

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    presence = cellxgene_census.get_presence_matrix(
        census,
        "homo_sapiens",
        var_value_filter="feature_
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-