Skills vs. docs in Claude agents: 250 evals and no easy answer

The engineering team at Wix asked a question many teams working with Claude agents have been putting off for months: is it worth packaging operational knowledge into skills, or is good documentation fed into context enough? Rather than relying on intuition or generic benchmarks, they ran 250 evaluations of their own and published their findings on the Wix Engineering Blog. The discussion appeared on Hacker News on May 12 with limited discussion so far, but the article itself carries enough technical depth to merit independent attention.

What they tested and how

Wix's experiment focused on real tasks from their internal pipeline: code generation, queries about proprietary APIs, and multi-step action sequences. For each task they designed two variants of agent instruction: one using skills—structured packages of instructions and context that Claude invokes on demand—and another providing equivalent textual documentation injected directly into the prompt. The 250 evals were distributed across both conditions with variations in task complexity and length.

Success criteria were not singular. They measured correct completion rate, number of tool calls required, coherence across steps in chained tasks, and, somewhat less common in such studies, how often the agent requested unnecessary clarification. This last indicator is especially relevant in production environments where each interruption carries real cost.

The uncomfortable answer

The main finding, stated honestly in the article's own headline, is that the answer is more complicated than expected. Skills do not always win. Documentation does not always lose. Relative performance depends on at least three variables that the team identifies clearly:

Domain familiarity: on tasks involving APIs or conventions very specific to Wix, skills showed a clear advantage. The agent needed fewer context tokens to arrive at the correct action and made fewer hallucination errors about proprietary parameters.
Length and chaining: on short single-step tasks, the difference was statistically marginal. The gap widened in sequences of four or more steps, where skills acted as coherence anchors.
Knowledge maintenance: here documentation has a practical advantage that the authors do not downplay. Updating a text block is faster than versioning a skill, especially when changes are frequent or the team lacks a consolidated workflow for publishing skills.

Why this matters beyond Wix

The skills-vs-docs debate is not trivial. Since Claude Code added stable support for skills and sub-agents, engineering teams face an architectural decision with medium-term implications: invest in building a maintained library of skills, or bet on a well-structured documentation strategy that any team member can edit without touching agent configuration.

Wix's work contributes something in short supply: proprietary data, in production, with explicit methodology. It is not an academic paper and does not pretend to be. It is the kind of empirical engineering that teams who have spent months working with Claude agents need to see in order to make informed decisions, not ones based on ecosystem promises.

It also highlights an aspect that Anthropic's official documentation treats relatively abstractly: the operational cost of maintaining skills. Building the initial package is one thing; ensuring it reflects the current state of a constantly changing internal API is another.

Who this is useful for

This analysis is directly relevant to teams that already have Claude agents deployed or in serious evaluation phases, especially if they work with proprietary codebases or internal APIs the model has no training knowledge of. It is also useful for those designing knowledge onboarding strategies in Claude Code: before deciding whether to build skills or maintain a documentation corpus, it helps to understand under what conditions each approach fails.

For teams still in early phases, the article functions as a map of questions they will need to ask themselves later.

---

Our takeaway from this is: skills are a bet with clear returns in stable domains and complex tasks, but they are not a silver bullet. The rigor with which Wix documented their methodology is, in itself, a standard the ecosystem should replicate more often.

Skills vs. docs in Claude agents: 250 evals and no easy answer

What they tested and how

The uncomfortable answer

Why this matters beyond Wix

Who this is useful for

Sources

Read next

MCP is becoming the default standard for building agents

AI Toolbox touts support for a Claude Opus version not in the catalog

One Click in the Browser, Context for Any Agent