Skip to main content
ClaudeWave
Skill188 repo starsupdated today

llm-integration

LLM integration patterns for function calling, streaming responses, local inference with Ollama, and fine-tuning customization. Use when implementing tool use, SSE streaming, local model deployment, LoRA/QLoRA fine-tuning, or multi-provider LLM APIs.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/yonatangross/orchestkit /tmp/llm-integration && cp -r /tmp/llm-integration/plugins/ork/skills/llm-integration ~/.claude/skills/llm-integration
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# LLM Integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in `rules/` loaded on-demand.

## Quick Reference

| Category | Rules | Impact | When to Use |
|----------|-------|--------|-------------|
| [Function Calling](#function-calling) | 3 | CRITICAL | Tool definitions, parallel execution, input validation |
| [Streaming](#streaming) | 3 | HIGH | SSE endpoints, structured streaming, backpressure handling |
| [Local Inference](#local-inference) | 3 | HIGH | Ollama setup, model selection, GPU optimization |
| [Fine-Tuning](#fine-tuning) | 3 | HIGH | LoRA/QLoRA training, dataset preparation, evaluation |
| [Context Optimization](#context-optimization) | 2 | HIGH | Window management, compression, caching, budget scaling |
| [Evaluation](#evaluation) | 2 | HIGH | LLM-as-judge, RAGAS metrics, quality gates, benchmarks |
| [Prompt Engineering](#prompt-engineering) | 4 | HIGH | CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization |

**Total: 20 rules across 7 categories**

## Quick Start

```python
# Function calling: strict mode tool definition
tools = [{
    "type": "function",
    "function": {
        "name": "search_documents",
        "description": "Search knowledge base",
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "description": "Max results"}
            },
            "required": ["query", "limit"],
            "additionalProperties": False
        }
    }
}]
```

```python
# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
    async def generate():
        async for token in async_stream(prompt):
            yield {"event": "token", "data": token}
        yield {"event": "done", "data": ""}
    return EventSourceResponse(generate())
```

```python
# Local inference: Ollama with LangChain
llm = ChatOllama(
    model="deepseek-r1:70b",
    base_url="http://localhost:11434",
    temperature=0.0,
    num_ctx=32768,
)
```

```python
# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
```

## Function Calling

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

- `calling-tool-definition.md` -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
- `calling-parallel.md` -- Parallel tool execution, asyncio.gather, strict mode constraints
- `calling-validation.md` -- Input validation, error handling, tool execution loops

## Streaming

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

- `streaming-sse.md` -- FastAPI SSE endpoints, frontend consumers, async iterators
- `streaming-structured.md` -- Streaming with tool calls, partial JSON parsing, chunk accumulation
- `streaming-backpressure.md` -- Backpressure handling, bounded buffers, cancellation

## Local Inference

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

- `local-ollama-setup.md` -- Installation, model pulling, environment configuration
- `local-model-selection.md` -- Model comparison by task, hardware profiles, quantization
- `local-gpu-optimization.md` -- Apple Silicon tuning, keep-alive, CI integration

## Fine-Tuning

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

- `tuning-lora.md` -- LoRA/QLoRA configuration, Unsloth training, adapter merging
- `tuning-dataset-prep.md` -- Synthetic data generation, quality validation, deduplication
- `tuning-evaluation.md` -- DPO alignment, evaluation metrics, anti-patterns

## Context Optimization

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

- `context-window-management.md` -- Five-layer architecture, anchored summarization, compression triggers
- `context-caching.md` -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

## Evaluation

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

- `evaluation-metrics.md` -- LLM-as-judge, RAGAS metrics, hallucination detection
- `evaluation-benchmarks.md` -- Quality gates, batch evaluation, pairwise comparison

## Prompt Engineering

Design, version, and optimize prompts for production LLM applications.

- `prompt-design.md` -- Chain-of-Thought, few-shot learning, pattern selection guide
- `prompt-testing.md` -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency
- `prompt-react-pattern.md` -- ReAct loop for tool-using agents, thought-action-observation format
- `prompt-optimization.md` -- Token reduction, cost optimization, model tiering, prompt spec format

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Tool schema mode | `strict: true` (2026 best practice) |
| Tool count | 5-15 max per request |
| Streaming protocol | SSE for web, WebSocket for bidirectional |
| Buffer size | 50-200 tokens |
| Local model (reasoning) | `deepseek-r1:70b` |
| Local model (coding) | `qwen2.5-coder:32b` |
| Fine-tuning approach | LoRA/QLoRA (try prompting first) |
| LoRA rank | 16-64 typical |
| Training epochs | 1-3 (more risks overfitting) |
| Context compression | Anchored iterative (60-80%) |
| Compress trigger | 70% utilization, target 50% |
| Judge model | `claude-haiku-4-5
accessibilitySkill

Accessibility patterns for WCAG 2.2 compliance, keyboard focus management, React Aria component patterns, cognitive inclusion, native HTML-first philosophy, and user preference honoring. Use when implementing screen reader support, keyboard navigation, ARIA patterns, focus traps, accessible component libraries, reduced motion, or cognitive accessibility.

agent-orchestrationSkill

Agent orchestration patterns for agentic loops, multi-agent coordination, alternative frameworks, and multi-scenario workflows. Use when building autonomous agent loops, coordinating multiple agents, evaluating CrewAI/AutoGen/Swarm, or orchestrating complex multi-step scenarios.

ai-ui-generationSkill

AI-assisted UI generation patterns for json-render, v0.app, Google Stitch, Bolt Cloud, and Cursor workflows. Covers prompt engineering for component and full-stack app generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.

analyticsSkill

Queries local analytics across OrchestKit projects for agent usage, skill frequency, hook timing, team activity, session replay, cost estimation, and model delegation trends. Privacy-safe with hashed project IDs. Supports time-range filtering and comparative analysis. Use when reviewing performance, estimating costs, or understanding usage patterns.

animation-motion-designSkill

Animation and motion design patterns using Motion library (formerly Framer Motion) and View Transitions API. Use when implementing component animations, page transitions, micro-interactions, gesture-driven UIs, or ensuring motion accessibility with prefers-reduced-motion.

api-designSkill

API design patterns for REST/GraphQL framework design, versioning strategies, and RFC 9457 error handling. Use when designing API endpoints, choosing versioning schemes, implementing Problem Details errors, or building OpenAPI specifications.

architecture-decision-recordSkill

Use this skill when documenting significant architectural decisions. Provides ADR templates following the Nygard format with sections for context, decision, consequences, and alternatives. Use when writing ADRs, recording decisions, or evaluating options.

architecture-patternsSkill

Architecture validation and patterns for clean architecture, backend structure enforcement, project structure validation, test standards, and context-aware sizing. Use when designing system boundaries, enforcing layered architecture, validating project structure, defining test standards, or choosing the right architecture tier for project scope.