bio-comparative-genomics-hgt-detection
This skill detects horizontal gene transfer events in bacterial and archaeal genomes using HGTector, compositional analysis, and phylogenetic incongruence detection. Use it when identifying foreign genes that show anomalous nucleotide composition or unexpected phylogenetic placement within prokaryotic genomes.
git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-comparative-genomics-hgt-detection && cp -r /tmp/bio-comparative-genomics-hgt-detection/skills/bio-comparative-genomics-hgt-detection ~/.claude/skills/bio-comparative-genomics-hgt-detectionSKILL.md
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
---
name: bio-comparative-genomics-hgt-detection
description: Detect horizontal gene transfer events using HGTector, compositional analysis, and phylogenetic incongruence methods. Identify foreign genes in bacterial and archaeal genomes from anomalous composition or unexpected phylogenetic placement. Use when searching for horizontally transferred genes or analyzing genome evolution in prokaryotes.
tool_type: mixed
primary_tool: HGTector
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
- read_file
- run_shell_command
---
# Horizontal Gene Transfer Detection
## HGTector Workflow
```python
'''HGT detection with HGTector and compositional methods'''
import subprocess
import pandas as pd
import numpy as np
from Bio import SeqIO
from collections import Counter
def run_hgtector(proteome, taxonomy_db, output_dir, threads=4):
'''Run HGTector for HGT detection
HGTector uses BLAST-based phyletic distribution analysis:
1. BLAST proteome against reference database
2. Classify genes by taxonomic distribution
3. Identify genes with unexpected phyletic patterns
Requires:
- NCBI taxonomy database
- Reference protein database (e.g., RefSeq)
'''
# Search against database
search_cmd = f'''hgtector search \\
-i {proteome} \\
-o {output_dir}/search \\
-m diamond \\
-t {threads} \\
-d refseq'''
subprocess.run(search_cmd, shell=True)
# Analyze results
analyze_cmd = f'''hgtector analyze \\
-i {output_dir}/search \\
-o {output_dir}/analyze \\
-t {taxonomy_db}'''
subprocess.run(analyze_cmd, shell=True)
return f'{output_dir}/analyze'
def parse_hgtector_results(results_dir):
'''Parse HGTector output for HGT candidates
Output columns:
- gene: Gene identifier
- close: Score for close taxonomic matches
- distal: Score for distal taxonomic matches
- hgt: HGT prediction (1 = putative HGT)
'''
results_file = f'{results_dir}/scores.tsv'
df = pd.read_csv(results_file, sep='\t')
# Classify HGT candidates
# distal > close suggests foreign origin
df['hgt_score'] = df['distal'] - df['close']
# Threshold: Higher positive score = stronger HGT signal
# Score > 0.5: Moderate HGT evidence
# Score > 1.0: Strong HGT evidence
df['hgt_call'] = df['hgt_score'] > 0.5
return df
```
## Compositional Analysis
```python
def calculate_gc_content(sequence):
'''Calculate GC content of a sequence'''
gc = sum(1 for nt in sequence.upper() if nt in 'GC')
return gc / len(sequence) if sequence else 0
def calculate_codon_usage(cds_sequence):
'''Calculate codon usage frequencies
Foreign genes often have different codon usage
reflecting their donor genome's bias
'''
if len(cds_sequence) % 3 != 0:
return None
codons = [cds_sequence[i:i+3] for i in range(0, len(cds_sequence) - 2, 3)]
counts = Counter(codons)
total = sum(counts.values())
return {codon: count / total for codon, count in counts.items()}
def calculate_cai(gene_codons, reference_codons):
'''Calculate Codon Adaptation Index
CAI measures how well a gene matches the host codon usage
Low CAI suggests foreign origin
CAI < 0.5: Potentially foreign
CAI 0.5-0.7: Intermediate
CAI > 0.7: Native-like codon usage
'''
import math
w_values = {}
for aa_codons in group_synonymous_codons(reference_codons):
max_freq = max(reference_codons.get(c, 0) for c in aa_codons)
if max_freq > 0:
for c in aa_codons:
w_values[c] = reference_codons.get(c, 0) / max_freq
cai_sum = 0
n = 0
for codon, freq in gene_codons.items():
if codon in w_values and w_values[codon] > 0:
cai_sum += math.log(w_values[codon]) * freq
n += freq
return math.exp(cai_sum) if n > 0 else 0
def group_synonymous_codons(codon_usage):
'''Group codons by amino acid'''
genetic_code = {
'F': ['TTT', 'TTC'], 'L': ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'],
'I': ['ATT', 'ATC', 'ATA'], 'M': ['ATG'], 'V': ['GTT', 'GTC', 'GTA', 'GTG'],
'S': ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'],
'P': ['CCT', 'CCC', 'CCA', 'CCG'], 'T': ['ACT', 'ACC', 'ACA', 'ACG'],
'A': ['GCT', 'GCC', 'GCA', 'GCG'], 'Y': ['TAT', 'TAC'],
'H': ['CAT', 'CAC'], 'Q': ['CAA', 'CAG'], 'N': ['AAT', 'AAC'],
'K': ['AAA', 'AAG'], 'D': ['GAT', 'GAC'], 'E': ['GAA', 'GAG'],
'C': ['TGT', 'TGC'], 'W': ['TGG'], 'R': ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
'G': ['GGT', 'GGC', 'GGA', 'GGG']
}
return [codons for codons in genetic_code.values()]
def detect_gc_anomalies(genome_fasta, cds_gff, window_size=5000):
'''Detect regions with anomalous GC content
Horizontally transferred regions often have
different GC content than the host genome
Threshold: >2 standard deviations from genome mean
'''
# Load genome
genome = SeqIO.read(genome_fasta, 'fasta')
genome_gc = calculate_gc_content(str(genome.seq))
# Calculate windowed GC
windows = []
seq = str(genome.seq)
for i in range(0, len(seq) - window_size, window_size // 2):
window_seq = seq[i:i + window_size]
gc = calculate_gc_content(window_seq)
windows.append({
'start': i,
'end': i + window_size,
'gc': gc
})
df = pd.DataFrame(windows)
# Identify anomalies
mean_gc = df['gc'].mean()
std_gc = df['gc'].std()
# Z-score threshold:Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.
Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.
AI驱动的综合健康分析系统,整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。
Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.