Skill2.9k repo starsupdated 7d ago

bio-comparative-genomics-hgt-detection

This skill detects horizontal gene transfer events in bacterial and archaeal genomes using HGTector, compositional analysis, and phylogenetic incongruence detection. Use it when identifying foreign genes that show anomalous nucleotide composition or unexpected phylogenetic placement within prokaryotic genomes.

View source Repository: OpenClaw-Medical-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-comparative-genomics-hgt-detection && cp -r /tmp/bio-comparative-genomics-hgt-detection/skills/bio-comparative-genomics-hgt-detection ~/.claude/skills/bio-comparative-genomics-hgt-detection

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA

-->

---
name: bio-comparative-genomics-hgt-detection
description: Detect horizontal gene transfer events using HGTector, compositional analysis, and phylogenetic incongruence methods. Identify foreign genes in bacterial and archaeal genomes from anomalous composition or unexpected phylogenetic placement. Use when searching for horizontally transferred genes or analyzing genome evolution in prokaryotes.
tool_type: mixed
primary_tool: HGTector
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
  - read_file
  - run_shell_command
---

# Horizontal Gene Transfer Detection

## HGTector Workflow

```python
'''HGT detection with HGTector and compositional methods'''

import subprocess
import pandas as pd
import numpy as np
from Bio import SeqIO
from collections import Counter


def run_hgtector(proteome, taxonomy_db, output_dir, threads=4):
    '''Run HGTector for HGT detection

    HGTector uses BLAST-based phyletic distribution analysis:
    1. BLAST proteome against reference database
    2. Classify genes by taxonomic distribution
    3. Identify genes with unexpected phyletic patterns

    Requires:
    - NCBI taxonomy database
    - Reference protein database (e.g., RefSeq)
    '''
    # Search against database
    search_cmd = f'''hgtector search \\
        -i {proteome} \\
        -o {output_dir}/search \\
        -m diamond \\
        -t {threads} \\
        -d refseq'''

    subprocess.run(search_cmd, shell=True)

    # Analyze results
    analyze_cmd = f'''hgtector analyze \\
        -i {output_dir}/search \\
        -o {output_dir}/analyze \\
        -t {taxonomy_db}'''

    subprocess.run(analyze_cmd, shell=True)

    return f'{output_dir}/analyze'


def parse_hgtector_results(results_dir):
    '''Parse HGTector output for HGT candidates

    Output columns:
    - gene: Gene identifier
    - close: Score for close taxonomic matches
    - distal: Score for distal taxonomic matches
    - hgt: HGT prediction (1 = putative HGT)
    '''
    results_file = f'{results_dir}/scores.tsv'
    df = pd.read_csv(results_file, sep='\t')

    # Classify HGT candidates
    # distal > close suggests foreign origin
    df['hgt_score'] = df['distal'] - df['close']

    # Threshold: Higher positive score = stronger HGT signal
    # Score > 0.5: Moderate HGT evidence
    # Score > 1.0: Strong HGT evidence
    df['hgt_call'] = df['hgt_score'] > 0.5

    return df
```

## Compositional Analysis

```python
def calculate_gc_content(sequence):
    '''Calculate GC content of a sequence'''
    gc = sum(1 for nt in sequence.upper() if nt in 'GC')
    return gc / len(sequence) if sequence else 0


def calculate_codon_usage(cds_sequence):
    '''Calculate codon usage frequencies

    Foreign genes often have different codon usage
    reflecting their donor genome's bias
    '''
    if len(cds_sequence) % 3 != 0:
        return None

    codons = [cds_sequence[i:i+3] for i in range(0, len(cds_sequence) - 2, 3)]
    counts = Counter(codons)
    total = sum(counts.values())

    return {codon: count / total for codon, count in counts.items()}


def calculate_cai(gene_codons, reference_codons):
    '''Calculate Codon Adaptation Index

    CAI measures how well a gene matches the host codon usage
    Low CAI suggests foreign origin

    CAI < 0.5: Potentially foreign
    CAI 0.5-0.7: Intermediate
    CAI > 0.7: Native-like codon usage
    '''
    import math

    w_values = {}
    for aa_codons in group_synonymous_codons(reference_codons):
        max_freq = max(reference_codons.get(c, 0) for c in aa_codons)
        if max_freq > 0:
            for c in aa_codons:
                w_values[c] = reference_codons.get(c, 0) / max_freq

    cai_sum = 0
    n = 0
    for codon, freq in gene_codons.items():
        if codon in w_values and w_values[codon] > 0:
            cai_sum += math.log(w_values[codon]) * freq
            n += freq

    return math.exp(cai_sum) if n > 0 else 0


def group_synonymous_codons(codon_usage):
    '''Group codons by amino acid'''
    genetic_code = {
        'F': ['TTT', 'TTC'], 'L': ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'],
        'I': ['ATT', 'ATC', 'ATA'], 'M': ['ATG'], 'V': ['GTT', 'GTC', 'GTA', 'GTG'],
        'S': ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'],
        'P': ['CCT', 'CCC', 'CCA', 'CCG'], 'T': ['ACT', 'ACC', 'ACA', 'ACG'],
        'A': ['GCT', 'GCC', 'GCA', 'GCG'], 'Y': ['TAT', 'TAC'],
        'H': ['CAT', 'CAC'], 'Q': ['CAA', 'CAG'], 'N': ['AAT', 'AAC'],
        'K': ['AAA', 'AAG'], 'D': ['GAT', 'GAC'], 'E': ['GAA', 'GAG'],
        'C': ['TGT', 'TGC'], 'W': ['TGG'], 'R': ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
        'G': ['GGT', 'GGC', 'GGA', 'GGG']
    }
    return [codons for codons in genetic_code.values()]


def detect_gc_anomalies(genome_fasta, cds_gff, window_size=5000):
    '''Detect regions with anomalous GC content

    Horizontally transferred regions often have
    different GC content than the host genome

    Threshold: >2 standard deviations from genome mean
    '''
    # Load genome
    genome = SeqIO.read(genome_fasta, 'fasta')
    genome_gc = calculate_gc_content(str(genome.seq))

    # Calculate windowed GC
    windows = []
    seq = str(genome.seq)
    for i in range(0, len(seq) - window_size, window_size // 2):
        window_seq = seq[i:i + window_size]
        gc = calculate_gc_content(window_seq)
        windows.append({
            'start': i,
            'end': i + window_size,
            'gc': gc
        })

    df = pd.DataFrame(windows)

    # Identify anomalies
    mean_gc = df['gc'].mean()
    std_gc = df['gc'].std()

    # Z-score threshold:

More from this repository

aav-vector-design-agentSkill

adaptyvSkill

Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.

adhd-daily-plannerSkill

Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that

aeonSkill

This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.

agent-browserSkill

Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.

agentd-drug-discoverySkill

ai-analyzerSkill

AI驱动的综合健康分析系统，整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。

alphafold-databaseSkill

Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.