Skill286 estrellas del repoactualizado 5d ago

cellxgene-census

The cellxgene-census skill provides programmatic access to over 61 million standardized single-cell RNA-seq observations from human and mouse through the CZ CELLxGENE Census. It enables querying expression data by cell type, tissue, disease, and donor metadata, returning results as AnnData objects or PyTorch dataloaders for machine learning workflows, making it ideal for population-scale comparative analysis and reference dataset construction.

Ver fuente Repositorio: SciAgent-Skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/cellxgene-census && cp -r /tmp/cellxgene-census/skills/genomics-bioinformatics/single-cell/cellxgene-census ~/.claude/skills/cellxgene-census

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# CZ CELLxGENE Census

## Overview

CZ CELLxGENE Census provides programmatic access to 61+ million standardized single-cell RNA-seq observations from human and mouse. It enables population-scale queries by cell type, tissue, disease, and donor metadata, returning expression data as AnnData objects or PyTorch dataloaders for ML workflows.

## When to Use

- Querying single-cell expression data across tissues, diseases, or cell types from a curated atlas
- Building reference datasets for cell type classification or marker gene discovery
- Training ML models on large-scale single-cell data (PyTorch integration)
- Comparing gene expression across conditions (e.g., COVID-19 vs healthy) at population scale
- Exploring what single-cell datasets are available for a tissue or disease of interest
- For **analyzing your own scRNA-seq data**, use scanpy instead
- For **manipulating AnnData objects** (subsetting, concatenation), use anndata instead

## Prerequisites

```bash
pip install cellxgene-census
# For ML workflows
pip install cellxgene-census[experimental]
```

**API Rate Limits**: Census uses TileDB-SOMA cloud backend. No explicit rate limit, but large queries (>1M cells) should use out-of-core processing (Module 4) to avoid memory exhaustion. Always use context managers for proper resource cleanup.

## Quick Start

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Get B cells from lung
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
        obs_column_names=["cell_type", "disease", "donor_id"],
    )
    print(f"Retrieved {adata.n_obs} cells × {adata.n_vars} genes")
    # Retrieved ~15000 cells × 60664 genes
```

## Core API

### 1. Opening and Exploring the Census

Connect to Census and discover available data.

```python
import cellxgene_census

# Open latest stable version (always use context manager)
with cellxgene_census.open_soma() as census:
    # Summary statistics
    summary = census["census_info"]["summary"].read().concat().to_pandas()
    print(f"Total cells: {summary['total_cell_count'][0]:,}")

    # List all datasets
    datasets = census["census_info"]["datasets"].read().concat().to_pandas()
    print(f"Total datasets: {len(datasets)}")
    print(datasets[["dataset_title", "cell_count"]].head())
```

```python
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
    # Reproducible analysis code here
    pass
```

### 2. Cell Metadata Queries

Query cell-level metadata without downloading expression data.

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Get unique cell types in brain
    cell_metadata = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["cell_type", "disease", "assay"]
    )
    print(f"Total brain cells: {len(cell_metadata):,}")
    print(cell_metadata["cell_type"].value_counts().head(10))
```

```python
    # Gene metadata query
    gene_metadata = cellxgene_census.get_var(
        census,
        "homo_sapiens",
        value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
        column_names=["feature_id", "feature_name", "feature_length"]
    )
    print(gene_metadata)
    # Returns DataFrame with Ensembl IDs, gene symbols, and lengths
```

### 3. Expression Data Queries (Small-Medium Scale)

Retrieve expression matrices as AnnData objects for queries returning <100k cells.

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Query by cell type + tissue + disease
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
        obs_column_names=["cell_type", "tissue_general", "donor_id"],
    )
    print(f"Shape: {adata.shape}")  # (n_cells, 4)
    print(f"Metadata columns: {list(adata.obs.columns)}")
```

**Filter syntax reference**:
- Combine conditions: `and`, `or`
- Multiple values: `feature_name in ['CD4', 'CD8A']`
- Comparison: `cell_count > 1000`
- Always include `is_primary_data == True` to avoid duplicate cells

### 4. Large-Scale Out-of-Core Queries

Stream expression data in chunks for queries exceeding available RAM.

```python
import cellxgene_census
import tiledbsoma as soma

with cellxgene_census.open_soma() as census:
    # Estimate query size first
    metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["soma_joinid"]
    )
    n_cells = len(metadata)
    print(f"Query will return {n_cells:,} cells")

    # If >100k cells, use streaming
    query = census["census_data"]["homo_sapiens"].axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(
            value_filter="tissue_general == 'brain' and is_primary_data == True"
        ),
        var_query=soma.AxisQuery(
            value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
        )
    )

    # Incremental statistics
    n_obs, total = 0, 0.0
    for batch in query.X("raw").tables():
        values = batch["soma_data"].to_numpy()
        n_obs += len(values)
        total += values.sum()

    print(f"Processed {n_obs:,} non-zero entries, mean={total/n_obs:.4f}")
```

### 5. Dataset Presence Matrix

Check which datasets measured specific genes (not all genes are in all datasets).

```python
import cellxgene_census

with cellxgene_census.open_soma() as census:
    presence = cellxgene_census.get_presence_matrix(
        census,
        "homo_sapiens",
        var_value_filter="feature_

Del mismo repositorio

sciagent-skill-creatorSkill

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill