anndata-data-structure
Annotated matrices for single-cell genomics. Stores X with obs/var metadata, layers, embeddings (obsm/varm), graphs (obsp/varp), uns. Use for .h5ad/.zarr I/O, concatenation, scverse integration. For analysis use scanpy; for probabilistic models use scvi-tools.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/anndata-data-structure && cp -r /tmp/anndata-data-structure/skills/genomics-bioinformatics/single-cell/anndata-data-structure ~/.claude/skills/anndata-data-structureSKILL.md
# AnnData — Annotated Data Matrices for Single-Cell Genomics
## Overview
AnnData provides the standard data structure for single-cell genomics in the scverse ecosystem. It stores an observations-by-variables matrix (X) alongside cell metadata (obs), gene metadata (var), layers, embeddings (obsm/varm), graphs (obsp/varp), and unstructured metadata (uns). Supports sparse matrices, H5AD/Zarr storage, backed mode for large files, and integration with Scanpy, scvi-tools, and Muon.
## When to Use
- Constructing annotated matrices from raw count data with cell/gene metadata
- Reading/writing `.h5ad` or `.zarr` files for single-cell experiments
- Subsetting cells by quality metrics, gene sets, or metadata conditions
- Concatenating multiple experimental batches with consistent metadata
- Storing multiple data layers (raw counts, normalized, scaled) in one object
- Working with large datasets exceeding RAM (backed mode, lazy concatenation)
- Preparing data for Scanpy or scvi-tools pipelines
- For single-cell **analysis** (clustering, DE, visualization), use `scanpy` instead
- For **probabilistic models**, use `scvi-tools` instead
## Prerequisites
- **Python packages**: `anndata`, `scipy`, `pandas`, `numpy`
- **Optional**: `scanpy` (analysis), `zarr` (cloud storage), `h5py` (HDF5 backend)
- **Data requirements**: count matrices (dense or sparse), cell/gene metadata tables
```bash
pip install "anndata>=0.10"
# Full ecosystem
pip install anndata scanpy zarr
```
## Quick Start
```python
import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
counts = csr_matrix(np.random.poisson(0.5, (500, 2000)).astype(np.float32))
obs = pd.DataFrame({"cell_type": np.random.choice(["T", "B", "NK"], 500)},
index=[f"cell_{i}" for i in range(500)])
var = pd.DataFrame(index=[f"ENSG{i:05d}" for i in range(2000)])
adata = ad.AnnData(X=counts, obs=obs, var=var)
adata.layers["raw_counts"] = counts.copy()
adata.write_h5ad("example.h5ad", compression="gzip")
print(f"Created: {adata.n_obs} cells x {adata.n_vars} genes")
# Created: 500 cells x 2000 genes
```
## Core API
### 1. Object Creation
Build AnnData objects from arrays, DataFrames, and sparse matrices.
```python
import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
# Minimal: just a matrix
adata_min = ad.AnnData(X=np.random.rand(100, 50).astype(np.float32))
print(f"Minimal: {adata_min.shape}") # (100, 50)
# Full: sparse matrix + obs/var metadata
n_obs, n_vars = 300, 1000
X = csr_matrix(np.random.poisson(1, (n_obs, n_vars)).astype(np.float32))
obs = pd.DataFrame({"cell_type": np.random.choice(["T", "B", "Mono"], n_obs),
"batch": np.repeat(["ctrl", "stim"], n_obs // 2)},
index=[f"cell_{i}" for i in range(n_obs)])
var = pd.DataFrame({"gene_symbol": [f"Gene_{i}" for i in range(n_vars)],
"mt": [i < 13 for i in range(n_vars)]},
index=[f"ENSG{i:05d}" for i in range(n_vars)])
adata = ad.AnnData(X=X, obs=obs, var=var)
print(f"Full: {adata.shape}, obs cols: {list(adata.obs.columns)}")
# Full: (300, 1000), obs cols: ['cell_type', 'batch']
# From a pandas DataFrame (rows=obs, columns=vars)
df = pd.DataFrame(np.random.rand(50, 20),
index=[f"sample_{i}" for i in range(50)],
columns=[f"feature_{i}" for i in range(20)])
adata_df = ad.AnnData(df)
print(f"From DataFrame: {adata_df.shape}") # (50, 20)
```
### 2. I/O Operations
Read and write in multiple formats including backed mode for large files.
```python
import anndata as ad
# H5AD (native format, recommended for most use cases)
adata = ad.read_h5ad("data.h5ad")
adata.write_h5ad("output.h5ad", compression="gzip") # gzip: smaller files
# 10X Genomics formats
adata_10x = ad.read_10x_h5("filtered_feature_bc_matrix.h5")
# adata_mtx = ad.read_10x_mtx("filtered_feature_bc_matrix/")
# Zarr format (cloud-friendly, parallel I/O)
adata.write_zarr("output.zarr")
adata_zarr = ad.read_zarr("output.zarr")
# Other formats
# adata = ad.read_csv("expression.csv")
# adata = ad.read_loom("data.loom")
print(f"Loaded: {adata.n_obs} obs x {adata.n_vars} vars")
```
```python
import anndata as ad
# Backed mode: lazy loading for files larger than RAM
adata_backed = ad.read_h5ad("large_data.h5ad", backed="r") # read-only
print(f"Backed: {adata_backed.n_obs} obs, isbacked={adata_backed.isbacked}")
# Filter on metadata (no data loaded), then load subset into memory
subset = adata_backed[adata_backed.obs["tissue"] == "brain"].to_memory()
print(f"Loaded subset: {subset.n_obs} cells")
# Read-write backed mode: adata_rw = ad.read_h5ad("data.h5ad", backed="r+")
# Format conversion: ad.read_loom("data.loom").write_h5ad("out.h5ad", compression="gzip")
```
### 3. Subsetting and Views
Select cells and genes by indices, names, boolean masks, or metadata conditions.
```python
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
# Boolean mask (most common)
t_cells = adata[adata.obs["cell_type"] == "T_cell"]
print(f"T cells: {t_cells.n_obs}, is_view: {t_cells.is_view}") # is_view: True
# Integer index / name-based / combined axis
first_100 = adata[:100, :500]
selected = adata[["cell_0", "cell_1"], ["ENSG00000", "ENSG00001"]]
# Combined metadata conditions
high_quality = adata[
(adata.obs["n_genes"] > 200) & (adata.obs["pct_mito"] < 0.2)
]
print(f"QC filter: {high_quality.n_obs} / {adata.n_obs} cells")
# Views vs copies: subsetting returns a view (lightweight, shares data)
# .copy() creates an independent object (REQUIRED before modification)
independent = adata[adata.obs["batch"] == "ctrl"].copy()
print(f"Is view: {independent.is_view}") # False
```
### 4. Layers, Embeddings, and Graphs
Store multiple data representations, dimensionality reductions, and cell-cell graphs.
```python
import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix
adata = ad.read_h5ad("data.h5ad")
# Layers: alter|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-