cellxgene-census
Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/cellxgene-census && cp -r /tmp/cellxgene-census/skills/genomics-bioinformatics/single-cell/cellxgene-census ~/.claude/skills/cellxgene-censusSKILL.md
# CZ CELLxGENE Census
## Overview
CZ CELLxGENE Census provides programmatic access to 61+ million standardized single-cell RNA-seq observations from human and mouse. It enables population-scale queries by cell type, tissue, disease, and donor metadata, returning expression data as AnnData objects or PyTorch dataloaders for ML workflows.
## When to Use
- Querying single-cell expression data across tissues, diseases, or cell types from a curated atlas
- Building reference datasets for cell type classification or marker gene discovery
- Training ML models on large-scale single-cell data (PyTorch integration)
- Comparing gene expression across conditions (e.g., COVID-19 vs healthy) at population scale
- Exploring what single-cell datasets are available for a tissue or disease of interest
- For **analyzing your own scRNA-seq data**, use scanpy instead
- For **manipulating AnnData objects** (subsetting, concatenation), use anndata instead
## Prerequisites
```bash
pip install cellxgene-census
# For ML workflows
pip install cellxgene-census[experimental]
```
**API Rate Limits**: Census uses TileDB-SOMA cloud backend. No explicit rate limit, but large queries (>1M cells) should use out-of-core processing (Module 4) to avoid memory exhaustion. Always use context managers for proper resource cleanup.
## Quick Start
```python
import cellxgene_census
with cellxgene_census.open_soma() as census:
# Get B cells from lung
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["cell_type", "disease", "donor_id"],
)
print(f"Retrieved {adata.n_obs} cells × {adata.n_vars} genes")
# Retrieved ~15000 cells × 60664 genes
```
## Core API
### 1. Opening and Exploring the Census
Connect to Census and discover available data.
```python
import cellxgene_census
# Open latest stable version (always use context manager)
with cellxgene_census.open_soma() as census:
# Summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]:,}")
# List all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
print(f"Total datasets: {len(datasets)}")
print(datasets[["dataset_title", "cell_count"]].head())
```
```python
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
# Reproducible analysis code here
pass
```
### 2. Cell Metadata Queries
Query cell-level metadata without downloading expression data.
```python
import cellxgene_census
with cellxgene_census.open_soma() as census:
# Get unique cell types in brain
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type", "disease", "assay"]
)
print(f"Total brain cells: {len(cell_metadata):,}")
print(cell_metadata["cell_type"].value_counts().head(10))
```
```python
# Gene metadata query
gene_metadata = cellxgene_census.get_var(
census,
"homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
column_names=["feature_id", "feature_name", "feature_length"]
)
print(gene_metadata)
# Returns DataFrame with Ensembl IDs, gene symbols, and lengths
```
### 3. Expression Data Queries (Small-Medium Scale)
Retrieve expression matrices as AnnData objects for queries returning <100k cells.
```python
import cellxgene_census
with cellxgene_census.open_soma() as census:
# Query by cell type + tissue + disease
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
print(f"Shape: {adata.shape}") # (n_cells, 4)
print(f"Metadata columns: {list(adata.obs.columns)}")
```
**Filter syntax reference**:
- Combine conditions: `and`, `or`
- Multiple values: `feature_name in ['CD4', 'CD8A']`
- Comparison: `cell_count > 1000`
- Always include `is_primary_data == True` to avoid duplicate cells
### 4. Large-Scale Out-of-Core Queries
Stream expression data in chunks for queries exceeding available RAM.
```python
import cellxgene_census
import tiledbsoma as soma
with cellxgene_census.open_soma() as census:
# Estimate query size first
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")
# If >100k cells, use streaming
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Incremental statistics
n_obs, total = 0, 0.0
for batch in query.X("raw").tables():
values = batch["soma_data"].to_numpy()
n_obs += len(values)
total += values.sum()
print(f"Processed {n_obs:,} non-zero entries, mean={total/n_obs:.4f}")
```
### 5. Dataset Presence Matrix
Check which datasets measured specific genes (not all genes are in all datasets).
```python
import cellxgene_census
with cellxgene_census.open_soma() as census:
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-