Skill2.9k repo starsupdated 7d ago

bio-comparative-genomics-synteny-analysis

This skill analyzes genome collinearity and syntenic blocks across species using MCScanX, SyRI, and JCVI tools. It detects conserved gene order, chromosomal rearrangements, and whole-genome duplications by processing GFF gene annotation files and BLASTP homology results. Use this skill when comparing genome structure between different species or identifying conserved genomic regions and syntenic relationships.

View source Repository: OpenClaw-Medical-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-comparative-genomics-synteny-analysis && cp -r /tmp/bio-comparative-genomics-synteny-analysis/skills/bio-comparative-genomics-synteny-analysis ~/.claude/skills/bio-comparative-genomics-synteny-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA

-->

---
name: bio-comparative-genomics-synteny-analysis
description: Analyze genome collinearity and syntenic blocks using MCScanX, SyRI, and JCVI for comparative genomics. Detect conserved gene order, chromosomal rearrangements, and whole-genome duplications. Use when comparing genome structure between species or identifying conserved genomic regions.
tool_type: mixed
primary_tool: MCScanX
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
  - read_file
  - run_shell_command
---

# Synteny Analysis

## MCScanX Workflow

```python
'''Synteny analysis with MCScanX and visualization'''

import subprocess
import pandas as pd
from collections import defaultdict

def prepare_mcscanx_input(gff_file, fasta_file, species_prefix):
    '''Prepare input files for MCScanX

    MCScanX requires:
    1. .gff file: gene positions (sp  gene  chr  start  end)
    2. .blast file: all-vs-all BLASTP results
    '''
    genes = []
    with open(gff_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if parts[2] == 'gene':
                chrom = parts[0]
                start, end = int(parts[3]), int(parts[4])
                gene_id = parts[8].split('ID=')[1].split(';')[0]
                genes.append(f'{species_prefix}\t{gene_id}\t{chrom}\t{start}\t{end}')

    with open(f'{species_prefix}.gff', 'w') as f:
        f.write('\n'.join(genes))

    return f'{species_prefix}.gff'


def run_mcscanx(gff1, gff2, blast_file, output_prefix):
    '''Run MCScanX for synteny detection

    Key parameters:
    -k: Match score for collinear genes (default 50)
    -g: Gap penalty (default -1)
    -s: Minimum syntenic block size (default 5 genes)
    -e: E-value threshold for BLAST (default 1e-5)
    '''
    # Combine GFF files
    subprocess.run(f'cat {gff1} {gff2} > {output_prefix}.gff', shell=True)

    # Copy BLAST file
    subprocess.run(f'cp {blast_file} {output_prefix}.blast', shell=True)

    # Run MCScanX
    # -s 5: Minimum 5 genes per syntenic block (smaller = more noise)
    # -m 25: Maximum gaps allowed (larger = more relaxed blocks)
    cmd = f'MCScanX -s 5 -m 25 {output_prefix}'
    subprocess.run(cmd, shell=True)

    return f'{output_prefix}.collinearity'


def parse_collinearity(collinearity_file):
    '''Parse MCScanX collinearity output

    Output format:
    ## Alignment X: score=N e_value=X N genes
    X-Y: gene1  gene2
    '''
    blocks = []
    current_block = None

    with open(collinearity_file) as f:
        for line in f:
            if line.startswith('## Alignment'):
                if current_block:
                    blocks.append(current_block)
                parts = line.strip().split()
                score = int(parts[3].split('=')[1])
                e_value = float(parts[4].split('=')[1])
                n_genes = int(parts[5])
                current_block = {
                    'score': score,
                    'e_value': e_value,
                    'n_genes': n_genes,
                    'gene_pairs': []
                }
            elif current_block and '-' in line and ':' in line:
                parts = line.strip().split()
                if len(parts) >= 3:
                    gene1, gene2 = parts[1], parts[2]
                    current_block['gene_pairs'].append((gene1, gene2))

    if current_block:
        blocks.append(current_block)

    return blocks


def classify_synteny_type(blocks, species1_chroms, species2_chroms):
    '''Classify syntenic relationships

    Types:
    - 1:1: Direct orthology (conserved)
    - 1:many: Lineage-specific duplication
    - many:many: Ancient WGD or complex rearrangement
    '''
    sp1_coverage = defaultdict(list)
    sp2_coverage = defaultdict(list)

    for block in blocks:
        for gene1, gene2 in block['gene_pairs']:
            chr1 = species1_chroms.get(gene1)
            chr2 = species2_chroms.get(gene2)
            if chr1 and chr2:
                sp1_coverage[chr1].append(chr2)
                sp2_coverage[chr2].append(chr1)

    classifications = []
    for chr1, partners in sp1_coverage.items():
        unique_partners = len(set(partners))
        if unique_partners == 1:
            classifications.append(('1:1', chr1, partners[0]))
        else:
            classifications.append(('1:many', chr1, set(partners)))

    return classifications
```

## SyRI for Structural Variants

```python
def run_syri(ref_genome, query_genome, alignment_file, output_prefix):
    '''Run SyRI for structural rearrangement identification

    SyRI detects:
    - Syntenic regions (SYN)
    - Inversions (INV)
    - Translocations (TRANS)
    - Duplications (DUP)
    - Insertions/Deletions (INS/DEL)

    Requires whole-genome alignment (minimap2 or MUMmer)
    '''
    # Align genomes with minimap2
    align_cmd = f'minimap2 -ax asm5 {ref_genome} {query_genome} > {output_prefix}.sam'
    subprocess.run(align_cmd, shell=True)

    # Run SyRI
    syri_cmd = f'syri -c {output_prefix}.sam -r {ref_genome} -q {query_genome} -F S --prefix {output_prefix}'
    subprocess.run(syri_cmd, shell=True)

    return f'{output_prefix}syri.out'


def parse_syri_output(syri_file):
    '''Parse SyRI structural variant output'''
    variants = []

    with open(syri_file) as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) >= 10:
                var_type = parts[10]
                ref_chr, ref_start, ref_end = parts[0], int(parts[1]), int(parts[2])
                qry_chr, qry_start, qry_end = parts[5], int(parts[

More from this repository

aav-vector-design-agentSkill

adaptyvSkill

Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.

adhd-daily-plannerSkill

Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that

aeonSkill

This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.

agent-browserSkill

Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.

agentd-drug-discoverySkill

ai-analyzerSkill

AI驱动的综合健康分析系统，整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。

alphafold-databaseSkill

Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.