Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

cbioportal-database

Cancer genomics (TCGA et al.) via cBioPortal REST API. Retrieve somatic mutations, CNAs, expression, clinical data (survival/stage/treatment) across thousands of studies. Use for TMB, oncoprints, survival analysis. For population frequencies use gnomad-database; for drug-gene interactions use opentargets-database.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/cbioportal-database && cp -r /tmp/cbioportal-database/skills/genomics-bioinformatics/databases/cbioportal-database ~/.claude/skills/cbioportal-database
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# cBioPortal Database

## Overview

cBioPortal for Cancer Genomics is a public repository of cancer genomics data including TCGA, ICGC, and hundreds of curated studies spanning 100+ cancer types. It provides somatic mutation profiles, copy number alterations (CNA), gene expression, clinical data (survival, stage, treatment history), and methylation data for tens of thousands of patient samples. Data is accessible via a REST API at `https://www.cbioportal.org/api/` with no authentication required.

## When to Use

- Retrieving somatic mutation profiles (variant type, amino acid change) for a gene across TCGA studies
- Querying copy number alteration data (amplification, deep deletion) for candidate cancer driver genes
- Accessing clinical data — overall survival, disease-free survival, tumor stage — for survival curve analysis
- Identifying which cancer studies have molecular profiling data for a specific cancer type (e.g., breast, lung)
- Downloading gene expression (RNA-seq FPKM/RSEM) data from specific TCGA cohorts for differential expression analysis
- Correlating genomic alterations with clinical outcomes in a specific study
- Use `gnomad-database` instead when you need population-level variant allele frequencies in healthy individuals
- For drug-gene interaction lookups use `opentargets-database`; cBioPortal provides the genomic alteration data, not drug interaction annotations

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`
- **Data requirements**: Entrez gene symbols (e.g., `TP53`), cBioPortal study IDs (e.g., `tcga_brca`), molecular profile IDs
- **Environment**: internet connection; no API key required
- **Rate limits**: no strict rate limits; use `time.sleep(0.2)` between batch requests for polite access

```bash
pip install requests pandas matplotlib
```

## Quick Start

```python
import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    """GET request to cBioPortal REST API, returns parsed JSON."""
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

# List available cancer types
cancer_types = cbio_get("cancer-types")
print(f"Total cancer types: {len(cancer_types)}")
# Total cancer types: 87

# Find TCGA breast cancer study
studies = cbio_get("studies", params={"keyword": "breast"})
brca = [s for s in studies if "tcga_brca" in s["studyId"]]
if brca:
    s = brca[0]
    print(f"Study: {s['name']}")
    print(f"  studyId: {s['studyId']}")
    print(f"  Samples: {s['allSampleCount']}")
# Study: Breast Invasive Carcinoma (TCGA, PanCancer Atlas)
#   studyId: brca_tcga_pan_can_atlas_2018
#   Samples: 1084
```

## Core API

### Query 1: Cancer Types and Studies

List available cancer types and find studies by cancer type or keyword.

```python
import requests
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

# Get all cancer types
cancer_types = cbio_get("cancer-types")
ct_df = pd.DataFrame(cancer_types)[["cancerTypeId", "name", "dedicatedColor"]]
print(f"Cancer types: {len(ct_df)}")
print(ct_df.head(5).to_string(index=False))

# Find all studies for a cancer type
lung_studies = cbio_get("studies", params={"keyword": "lung adenocarcinoma"})
print(f"\nLung adenocarcinoma studies: {len(lung_studies)}")
for s in lung_studies[:3]:
    print(f"  {s['studyId']:40s}  n={s['allSampleCount']}")
```

```python
# Get detailed study metadata including available data types
study_id = "brca_tcga_pan_can_atlas_2018"
study = cbio_get(f"studies/{study_id}")
print(f"Study: {study['name']}")
print(f"  Reference genome: {study.get('referenceGenome', 'n/a')}")
print(f"  All sample count: {study['allSampleCount']}")

# List molecular profiles for the study
profiles = cbio_get("molecular-profiles", params={"studyId": study_id})
print(f"\nMolecular profiles ({len(profiles)} total):")
for p in profiles:
    print(f"  {p['molecularProfileId']:55s}  [{p['molecularAlterationType']}]")
```

### Query 2: Somatic Mutations

Retrieve mutation data for a gene or set of genes in a study's mutation profile.

```python
import requests, json
import pandas as pd

BASE_URL = "https://www.cbioportal.org/api"

def cbio_post(endpoint, body):
    """POST request to cBioPortal REST API."""
    r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
                      headers={"Accept": "application/json",
                               "Content-Type": "application/json"},
                      timeout=60)
    r.raise_for_status()
    return r.json()

def cbio_get(endpoint, params=None):
    r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
                     headers={"Accept": "application/json"}, timeout=30)
    r.raise_for_status()
    return r.json()

# Get all samples for a study
study_id = "brca_tcga_pan_can_atlas_2018"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
print(f"Total samples: {len(sample_ids)}")

# Mutation profile ID follows pattern: {studyId}_mutations
profile_id = f"{study_id}_mutations"

# Fetch mutations for TP53 (Entrez gene ID = 7157)
body = {
    "sampleIds": sample_ids[:200],   # first 200 samples
    "entrezGeneIds": [7157]           # TP53
}
mutations = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
print(f"TP53 mutations in first 200 samples: {len(mutations)}")

# Summarize by mutation type
mut_df = pd.DataFrame(mutations)
print("\nMutation type distribution:")
print(mut_df["mutationType"].value_counts().head(8).to_string())
# Missense_Mutation    102
# Nonsense_Mutation     28
# Splice_Site           14
# Frame_Shift_Del       12
```

##
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-