bio-clip-seq-clip-preprocessing
This Claude Code skill preprocesses CLIP-seq data for RNA-protein interaction analysis by performing UMI extraction, adapter trimming, and PCR duplicate removal using umi_tools and cutadapt. Use this skill when preparing raw CLIP, iCLIP, or eCLIP sequencing reads before peak calling to remove technical artifacts and identify genuine cross-linked sites.
git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-clip-seq-clip-preprocessing && cp -r /tmp/bio-clip-seq-clip-preprocessing/skills/bio-clip-seq-clip-preprocessing ~/.claude/skills/bio-clip-seq-clip-preprocessingSKILL.md
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
---
name: bio-clip-seq-clip-preprocessing
description: Preprocess CLIP-seq data including adapter trimming, UMI extraction, and PCR duplicate removal. Use when preparing raw CLIP, iCLIP, or eCLIP reads for peak calling.
tool_type: cli
primary_tool: umi_tools
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
- read_file
- run_shell_command
---
# CLIP-seq Preprocessing
## UMI Extraction (eCLIP/iCLIP)
```bash
# Extract UMI from read 1
umi_tools extract \
--stdin=reads_R1.fastq.gz \
--read2-in=reads_R2.fastq.gz \
--bc-pattern=NNNNNNNNNN \
--stdout=R1_umi.fastq.gz \
--read2-out=R2_umi.fastq.gz
# bc-pattern: UMI barcode pattern
# N = UMI base
# For eCLIP: typically 10-nt UMI in read 1
```
## Adapter Trimming
```bash
# Trim adapters after UMI extraction
cutadapt \
-a AGATCGGAAGAGCACACGTCT \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-m 18 \
-o trimmed_R1.fastq.gz \
-p trimmed_R2.fastq.gz \
R1_umi.fastq.gz R2_umi.fastq.gz
```
## Two-Pass Trimming (eCLIP)
```bash
# eCLIP protocol has inline adapters
# First pass: trim 3' adapter
cutadapt -a AGATCGGAAGAGC -m 18 -o pass1.fq.gz input.fq.gz
# Second pass: trim 5' adapter (read-through)
cutadapt -g AGATCGGAAGAGC -m 18 -o pass2.fq.gz pass1.fq.gz
```
## PCR Duplicate Removal
```bash
# After alignment, deduplicate using UMIs
umi_tools dedup \
--stdin=aligned.bam \
--stdout=deduped.bam \
--paired \
--method=unique
# Methods:
# unique: Exact UMI match
# cluster: Allow UMI mismatches (default)
# adjacency: Network-based clustering
```
## Python Preprocessing
```python
from umi_tools import UMIClusterer
import pysam
def count_umis_per_position(bam_path):
'''Count unique UMIs at each genomic position'''
from collections import defaultdict
position_umis = defaultdict(set)
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for read in bam:
if read.is_unmapped:
continue
# Extract UMI from read name (added by umi_tools extract)
umi = read.query_name.split('_')[-1]
pos = (read.reference_name, read.reference_start)
position_umis[pos].add(umi)
return {pos: len(umis) for pos, umis in position_umis.items()}
```
## Quality Control
```python
def clip_qc(bam_path):
'''CLIP-seq specific QC metrics'''
import pysam
total = 0
unique_positions = set()
read_lengths = []
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for read in bam:
if read.is_unmapped:
continue
total += 1
unique_positions.add((read.reference_name, read.reference_start))
read_lengths.append(read.query_length)
return {
'total_reads': total,
'unique_positions': len(unique_positions),
'mean_read_length': sum(read_lengths) / len(read_lengths),
'complexity': len(unique_positions) / total
}
```
## Related Skills
- clip-alignment - Align preprocessed reads
- read-qc/umi-processing - General UMI handling
- clip-peak-calling - Call peaks from aligned reads
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.
Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.
AI驱动的综合健康分析系统,整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。
Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.