Skill2.9k repo starsupdated 7d ago

bio-comparative-genomics-ancestral-reconstruction

This skill reconstructs ancestral protein and nucleotide sequences at internal nodes of phylogenetic trees using PAML and IQ-TREE marginal likelihood methods. It infers ancient sequences by analyzing evolutionary relationships and posterior probabilities across alignment positions. Use this when resurrecting extinct proteins, tracing molecular evolution across deep time, or inferring functional states at specific evolutionary divergence points.

View source Repository: OpenClaw-Medical-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-comparative-genomics-ancestral-reconstruction && cp -r /tmp/bio-comparative-genomics-ancestral-reconstruction/skills/bio-comparative-genomics-ancestral-reconstruction ~/.claude/skills/bio-comparative-genomics-ancestral-reconstruction

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA

-->

---
name: bio-comparative-genomics-ancestral-reconstruction
description: Reconstruct ancestral sequences at phylogenetic nodes using PAML and IQ-TREE marginal likelihood methods. Infer ancient protein sequences and trace evolutionary trajectories through sequence history. Use when inferring ancestral states for protein resurrection or tracing evolutionary history.
tool_type: mixed
primary_tool: PAML
measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes.
allowed-tools:
  - read_file
  - run_shell_command
---

# Ancestral Sequence Reconstruction

## PAML Ancestral Reconstruction

```python
'''Ancestral sequence reconstruction with PAML codeml/baseml'''

import subprocess
import re
from Bio import SeqIO
from Bio.Seq import Seq


def create_asr_control(alignment, tree, output_dir, seq_type='protein'):
    '''Create control file for ancestral reconstruction

    RateAncestor = 1: Enable ancestral reconstruction
    Generates RST file with ancestral sequences

    For codons: Use codeml with seqtype = 1
    For amino acids: Use codeml with seqtype = 2
    For nucleotides: Use baseml
    '''
    if seq_type == 'protein':
        ctl = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/asr.mlc

      seqtype = 2
        model = 3
    aaRatefile = wag.dat

 RateAncestor = 1
    cleandata = 0
        '''
    else:  # codon
        ctl = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/asr.mlc

      seqtype = 1
    CodonFreq = 2
        model = 0
      NSsites = 0

 RateAncestor = 1
    cleandata = 0
        '''

    ctl_file = f'{output_dir}/asr.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl)

    return ctl_file


def parse_rst_file(rst_file):
    '''Parse PAML RST file for ancestral sequences

    RST contains:
    - Tree with node numbers
    - Ancestral sequences at each node
    - Posterior probabilities for each site

    Node numbering: Extant sequences first, then internal nodes
    '''
    ancestors = {}
    current_node = None
    current_seq = []

    with open(rst_file) as f:
        content = f.read()

    # Find ancestral sequence section
    if 'Ancestral reconstruction by' in content:
        sections = content.split('Ancestral reconstruction by')
        for section in sections[1:]:
            lines = section.strip().split('\n')
            for line in lines:
                if line.startswith('node #'):
                    if current_node and current_seq:
                        ancestors[current_node] = ''.join(current_seq)
                    match = re.search(r'node #(\d+)', line)
                    if match:
                        current_node = f'Node_{match.group(1)}'
                        current_seq = []
                elif current_node and line.strip() and not line.startswith(' '):
                    # Sequence line
                    seq_part = ''.join(line.split()[1:]) if len(line.split()) > 1 else ''
                    current_seq.append(seq_part)

    if current_node and current_seq:
        ancestors[current_node] = ''.join(current_seq)

    return ancestors


def extract_marginal_probabilities(rst_file):
    '''Extract site-wise posterior probabilities

    High confidence: P > 0.95 (commonly used threshold)
    Moderate confidence: P > 0.80
    Low confidence: P < 0.80 (consider alternatives)

    Report ambiguous sites for experimental validation
    '''
    site_probs = []

    with open(rst_file) as f:
        in_probs = False
        for line in f:
            if 'Prob of best state' in line:
                in_probs = True
                continue
            if in_probs and line.strip():
                parts = line.split()
                if len(parts) >= 3:
                    try:
                        site = int(parts[0])
                        state = parts[1]
                        prob = float(parts[2])
                        site_probs.append({
                            'site': site,
                            'state': state,
                            'probability': prob,
                            'confidence': 'high' if prob > 0.95 else 'moderate' if prob > 0.8 else 'low'
                        })
                    except ValueError:
                        in_probs = False

    return site_probs
```

## IQ-TREE Ancestral Reconstruction

```python
def run_iqtree_asr(alignment, tree=None, model='LG+G4', output_prefix='asr'):
    '''Run IQ-TREE for ancestral sequence reconstruction

    IQ-TREE provides:
    - Marginal reconstruction (default)
    - Joint reconstruction (-asr-joint)
    - State file (.state) with probabilities

    Advantages over PAML:
    - Automatic model selection
    - Better handling of gaps
    - Faster for large datasets
    '''
    cmd = f'iqtree2 -s {alignment} -m {model} --ancestral -pre {output_prefix}'

    if tree:
        cmd += f' -te {tree}'

    subprocess.run(cmd, shell=True)

    return f'{output_prefix}.state'


def parse_iqtree_state(state_file):
    '''Parse IQ-TREE .state file

    Format: Node  Site  State  Probability  [other states and probs]
    '''
    ancestors = {}

    with open(state_file) as f:
        next(f)  # Skip header
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) >= 4:
                node = parts[0]
                site = int(parts[1])
                state = parts[2]
                prob = float(parts[3])

                if node not in ancestors:
                    ancestors[node] = {'sequence': [], 'probabilities': []}
                ancestors[node]['sequen

More from this repository

aav-vector-design-agentSkill

adaptyvSkill

Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.

adhd-daily-plannerSkill

Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that

aeonSkill

This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.

agent-browserSkill

Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.

agentd-drug-discoverySkill

ai-analyzerSkill

AI驱动的综合健康分析系统，整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。

alphafold-databaseSkill

Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.