bio-comparative-genomics-synteny-analysis
This skill analyzes genome collinearity and syntenic blocks across species using MCScanX, SyRI, and JCVI tools. It detects conserved gene order, chromosomal rearrangements, and whole-genome duplications by processing GFF gene annotation files and BLASTP homology results. Use this skill when comparing genome structure between different species or identifying conserved genomic regions and syntenic relationships.
git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-comparative-genomics-synteny-analysis && cp -r /tmp/bio-comparative-genomics-synteny-analysis/skills/bio-comparative-genomics-synteny-analysis ~/.claude/skills/bio-comparative-genomics-synteny-analysisSKILL.md
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
---
name: bio-comparative-genomics-synteny-analysis
description: Analyze genome collinearity and syntenic blocks using MCScanX, SyRI, and JCVI for comparative genomics. Detect conserved gene order, chromosomal rearrangements, and whole-genome duplications. Use when comparing genome structure between species or identifying conserved genomic regions.
tool_type: mixed
primary_tool: MCScanX
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
- read_file
- run_shell_command
---
# Synteny Analysis
## MCScanX Workflow
```python
'''Synteny analysis with MCScanX and visualization'''
import subprocess
import pandas as pd
from collections import defaultdict
def prepare_mcscanx_input(gff_file, fasta_file, species_prefix):
'''Prepare input files for MCScanX
MCScanX requires:
1. .gff file: gene positions (sp gene chr start end)
2. .blast file: all-vs-all BLASTP results
'''
genes = []
with open(gff_file) as f:
for line in f:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
if parts[2] == 'gene':
chrom = parts[0]
start, end = int(parts[3]), int(parts[4])
gene_id = parts[8].split('ID=')[1].split(';')[0]
genes.append(f'{species_prefix}\t{gene_id}\t{chrom}\t{start}\t{end}')
with open(f'{species_prefix}.gff', 'w') as f:
f.write('\n'.join(genes))
return f'{species_prefix}.gff'
def run_mcscanx(gff1, gff2, blast_file, output_prefix):
'''Run MCScanX for synteny detection
Key parameters:
-k: Match score for collinear genes (default 50)
-g: Gap penalty (default -1)
-s: Minimum syntenic block size (default 5 genes)
-e: E-value threshold for BLAST (default 1e-5)
'''
# Combine GFF files
subprocess.run(f'cat {gff1} {gff2} > {output_prefix}.gff', shell=True)
# Copy BLAST file
subprocess.run(f'cp {blast_file} {output_prefix}.blast', shell=True)
# Run MCScanX
# -s 5: Minimum 5 genes per syntenic block (smaller = more noise)
# -m 25: Maximum gaps allowed (larger = more relaxed blocks)
cmd = f'MCScanX -s 5 -m 25 {output_prefix}'
subprocess.run(cmd, shell=True)
return f'{output_prefix}.collinearity'
def parse_collinearity(collinearity_file):
'''Parse MCScanX collinearity output
Output format:
## Alignment X: score=N e_value=X N genes
X-Y: gene1 gene2
'''
blocks = []
current_block = None
with open(collinearity_file) as f:
for line in f:
if line.startswith('## Alignment'):
if current_block:
blocks.append(current_block)
parts = line.strip().split()
score = int(parts[3].split('=')[1])
e_value = float(parts[4].split('=')[1])
n_genes = int(parts[5])
current_block = {
'score': score,
'e_value': e_value,
'n_genes': n_genes,
'gene_pairs': []
}
elif current_block and '-' in line and ':' in line:
parts = line.strip().split()
if len(parts) >= 3:
gene1, gene2 = parts[1], parts[2]
current_block['gene_pairs'].append((gene1, gene2))
if current_block:
blocks.append(current_block)
return blocks
def classify_synteny_type(blocks, species1_chroms, species2_chroms):
'''Classify syntenic relationships
Types:
- 1:1: Direct orthology (conserved)
- 1:many: Lineage-specific duplication
- many:many: Ancient WGD or complex rearrangement
'''
sp1_coverage = defaultdict(list)
sp2_coverage = defaultdict(list)
for block in blocks:
for gene1, gene2 in block['gene_pairs']:
chr1 = species1_chroms.get(gene1)
chr2 = species2_chroms.get(gene2)
if chr1 and chr2:
sp1_coverage[chr1].append(chr2)
sp2_coverage[chr2].append(chr1)
classifications = []
for chr1, partners in sp1_coverage.items():
unique_partners = len(set(partners))
if unique_partners == 1:
classifications.append(('1:1', chr1, partners[0]))
else:
classifications.append(('1:many', chr1, set(partners)))
return classifications
```
## SyRI for Structural Variants
```python
def run_syri(ref_genome, query_genome, alignment_file, output_prefix):
'''Run SyRI for structural rearrangement identification
SyRI detects:
- Syntenic regions (SYN)
- Inversions (INV)
- Translocations (TRANS)
- Duplications (DUP)
- Insertions/Deletions (INS/DEL)
Requires whole-genome alignment (minimap2 or MUMmer)
'''
# Align genomes with minimap2
align_cmd = f'minimap2 -ax asm5 {ref_genome} {query_genome} > {output_prefix}.sam'
subprocess.run(align_cmd, shell=True)
# Run SyRI
syri_cmd = f'syri -c {output_prefix}.sam -r {ref_genome} -q {query_genome} -F S --prefix {output_prefix}'
subprocess.run(syri_cmd, shell=True)
return f'{output_prefix}syri.out'
def parse_syri_output(syri_file):
'''Parse SyRI structural variant output'''
variants = []
with open(syri_file) as f:
for line in f:
parts = line.strip().split('\t')
if len(parts) >= 10:
var_type = parts[10]
ref_chr, ref_start, ref_end = parts[0], int(parts[1]), int(parts[2])
qry_chr, qry_start, qry_end = parts[5], int(parts[Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.
Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.
AI驱动的综合健康分析系统,整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。
Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.