dspy-gepa-reflective
This skill should be used when the user asks to "optimize an agent with GEPA", "use reflective optimization", "optimize ReAct agents", "provide feedback metrics", mentions "GEPA optimizer", "LLM reflection", "execution trajectories", "agentic systems optimization", or needs to optimize complex multi-step agents using textual feedback on execution traces.
git clone --depth 1 https://github.com/OmidZamani/dspy-skills /tmp/dspy-gepa-reflective && cp -r /tmp/dspy-gepa-reflective/skills/dspy-gepa-reflective ~/.claude/skills/dspy-gepa-reflectiveSKILL.md
# DSPy GEPA Optimizer
## Goal
Optimize complex agentic systems using LLM reflection on full execution traces with Pareto-based evolutionary search.
## When to Use
- **Agentic systems** with tool use
- When you have **rich textual feedback** on failures
- Complex multi-step workflows
- Instruction-only optimization needed
## Related Skills
- For non-agentic programs: [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md)
- Measure improvements: [dspy-evaluation-suite](../dspy-evaluation-suite/SKILL.md)
## Inputs
| Input | Type | Description |
|-------|------|-------------|
| `program` | `dspy.Module` | Agent or complex program |
| `trainset` | `list[dspy.Example]` | Training examples |
| `metric` | `callable` | Accepts five arguments and returns `dspy.Prediction(score=..., feedback=...)` |
| `reflection_lm` | `dspy.LM` | Strong LM for reflection (GPT-4) |
| `auto` | `str` | "light", "medium", "heavy" |
## Outputs
| Output | Type | Description |
|--------|------|-------------|
| `compiled_program` | `dspy.Module` | Reflectively optimized program |
## Workflow
### Phase 1: Define Feedback Metric
GEPA requires metrics that return *textual feedback*:
```python
def gepa_metric(example, pred, trace=None, pred_name=None, pred_trace=None):
"""Return score and actionable feedback for GEPA reflection."""
is_correct = example.answer.lower() in pred.answer.lower()
if is_correct:
feedback = "Correct. The answer accurately addresses the question."
else:
feedback = f"Incorrect. Expected '{example.answer}' but got '{pred.answer}'. The model may have misunderstood the question or retrieved irrelevant information."
return dspy.Prediction(score=float(is_correct), feedback=feedback)
```
### Phase 2: Setup Agent
```python
import dspy
def search(query: str) -> list[str]:
"""Search knowledge base for relevant information."""
rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
results = rm(query, k=3)
return results if isinstance(results, list) else [results]
def calculate(expression: str) -> float:
"""Safely evaluate mathematical expressions."""
with dspy.PythonInterpreter() as interp:
return interp(expression)
agent = dspy.ReAct("question -> answer", tools=[search, calculate])
```
### Phase 3: Optimize with GEPA
```python
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
optimizer = dspy.GEPA(
metric=gepa_metric,
reflection_lm=dspy.LM("openai/gpt-4o"), # Strong model for reflection
auto="medium"
)
compiled_agent = optimizer.compile(agent, trainset=trainset)
```
## Production Example
```python
import dspy
from dspy.evaluate import Evaluate
import logging
logger = logging.getLogger(__name__)
class ResearchAgent(dspy.Module):
def __init__(self):
self.react = dspy.ReAct(
"question -> answer",
tools=[self.search, self.summarize]
)
def search(self, query: str) -> list[str]:
"""Search for relevant documents."""
rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
results = rm(query, k=5)
return results if isinstance(results, list) else [results]
def summarize(self, text: str) -> str:
"""Summarize long text into key points."""
summarizer = dspy.Predict("text -> summary")
return summarizer(text=text).summary
def forward(self, question):
return self.react(question=question)
def detailed_feedback_metric(example, pred, trace=None, pred_name=None, pred_trace=None):
"""Rich feedback for GEPA reflection."""
expected = example.answer.lower().strip()
actual = pred.answer.lower().strip() if pred.answer else ""
# Exact match
if expected == actual:
return dspy.Prediction(score=1.0, feedback="Perfect match. Answer is correct and concise.")
# Partial match
if expected in actual or actual in expected:
return dspy.Prediction(score=0.7, feedback=f"Partial match. Expected '{example.answer}', got '{pred.answer}'. Answer contains correct info but may be verbose or incomplete.")
# Check for key terms
expected_terms = set(expected.split())
actual_terms = set(actual.split())
overlap = len(expected_terms & actual_terms) / max(len(expected_terms), 1)
if overlap > 0.5:
return dspy.Prediction(score=0.5, feedback=f"Some overlap. Expected '{example.answer}', got '{pred.answer}'. Key terms present but answer structure differs.")
return dspy.Prediction(score=0.0, feedback=f"Incorrect. Expected '{example.answer}', got '{pred.answer}'. The agent may need better search queries or reasoning.")
def optimize_research_agent(trainset, devset):
"""Full GEPA optimization pipeline."""
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
agent = ResearchAgent()
# Convert metric for evaluation (just score)
def eval_metric(example, pred, trace=None):
return detailed_feedback_metric(example, pred, trace).score
evaluator = Evaluate(devset=devset, num_threads=8, metric=eval_metric)
baseline = evaluator(agent)
logger.info(f"Baseline: {baseline:.2%}")
# GEPA optimization
optimizer = dspy.GEPA(
metric=detailed_feedback_metric,
reflection_lm=dspy.LM("openai/gpt-4o"),
auto="medium"
)
compiled = optimizer.compile(agent, trainset=trainset)
optimized = evaluator(compiled)
logger.info(f"Optimized: {optimized:.2%}")
compiled.save("research_agent_gepa.json")
return compiled
```
## Metric Contract
GEPA metrics must accept `(gold, pred, trace, pred_name, pred_trace)`. Return `dspy.Prediction(score=..., feedback=...)` when textual feedback is available. Do not pass `enable_tool_optimization`; it is not a DSPy 3.2.1 `GEPA` constructor argument.
## Best Practices
1. **Rich feedback** - More detailed feedUse this skill when you need to QA audit and fix a plugin skill file. Provides a methodology for verifying skill content against official documentation, fixing issues in-place, and producing verification reports.
This skill should be used when the user asks to "choose a DSPy adapter", "use JSONAdapter", "use XMLAdapter", "enable native function calling", "send images, audio, or files to DSPy", mentions `dspy.ChatAdapter`, `dspy.JSONAdapter`, `dspy.XMLAdapter`, `dspy.Image`, `dspy.Audio`, `dspy.File`, structured outputs, or multimodal DSPy signatures.
This skill should be used when the user asks to "compose DSPy modules", "use Ensemble optimizer", "combine multiple programs", "use dspy.MultiChainComparison", mentions "ensemble voting", "module composition", "sequential pipelines", or needs to build complex multi-module DSPy programs with ensemble patterns or multi-chain comparison.
This skill should be used when the user asks to "use BetterTogether", "combine prompt optimization and fine-tuning", "sequence DSPy optimizers", "run prompt then weight optimization", mentions `dspy.BetterTogether`, strategy strings such as "p -> w -> p", or needs to compose multiple DSPy teleprompters into an evaluated optimization sequence.
This skill should be used when the user asks to "bootstrap few-shot examples", "generate demonstrations", "use BootstrapFewShot", "optimize with limited data", "create training demos automatically", mentions "teacher model for few-shot", "10-50 training examples", or wants automatic demonstration generation for a DSPy program without extensive compute.
This skill should be used when the user asks to "create custom DSPy module", "design a DSPy module", "extend dspy.Module", "build reusable DSPy component", mentions "custom module patterns", "module serialization", "stateful modules", "module testing", or needs to design production-quality custom DSPy modules with proper architecture, state management, and testing.
This skill should be used when the user asks to "debug DSPy programs", "trace LLM calls", "monitor production DSPy", "use MLflow with DSPy", mentions "inspect_history", "custom callbacks", "observability", "production monitoring", "cost tracking", or needs to debug, trace, and monitor DSPy applications in development and production.
This skill should be used when the user asks to "build local DSPy retrieval", "use dspy.Embedder", "use dspy.Embeddings", "save an embeddings index", "add FAISS retrieval", mentions semantic search, hosted embeddings, local embedding models, `EmbeddingsWithScores`, or needs a DSPy retriever over an application-owned text corpus.