evaluating-code-models
This Claude Code skill provides the BigCode Evaluation Harness, a benchmarking framework that assesses code generation models across 15+ standardized datasets including HumanEval, MBPP, and MultiPL-E spanning 18 programming languages. Use this skill when comparing coding capabilities of language models, measuring code generation quality through pass@k metrics, testing multi-language code support, or reproducing evaluations from HuggingFace leaderboards.
git clone --depth 1 https://github.com/foryourhealth111-pixel/Vibe-Skills /tmp/evaluating-code-models && cp -r /tmp/evaluating-code-models/bundled/skills/evaluating-code-models ~/.claude/skills/evaluating-code-modelsSKILL.md
# BigCode Evaluation Harness - Code Model Benchmarking
## Quick Start
BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
**Installation**:
```bash
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config
```
**Evaluate on HumanEval**:
```bash
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generations
```
**View available tasks**:
```bash
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"
```
## Common Workflows
### Workflow 1: Standard Code Benchmark Evaluation
Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
**Checklist**:
```
Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results
```
**Step 1: Choose benchmark suite**
**Python code generation** (most common):
- **HumanEval**: 164 handwritten problems, function completion
- **HumanEval+**: Same 164 problems with 80× more tests (stricter)
- **MBPP**: 500 crowd-sourced problems, entry-level difficulty
- **MBPP+**: 399 curated problems with 35× more tests
**Multi-language** (18 languages):
- **MultiPL-E**: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.
**Advanced**:
- **APPS**: 10,000 problems (introductory/interview/competition)
- **DS-1000**: 1,000 data science problems across 7 libraries
**Step 2: Configure model and generation**
```bash
# Standard HuggingFace model
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--do_sample True \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution
# Quantized model (4-bit)
accelerate launch main.py \
--model codellama/CodeLlama-34b-hf \
--tasks humaneval \
--load_in_4bit \
--max_length_generation 512 \
--allow_code_execution
# Custom/private model
accelerate launch main.py \
--model /path/to/my-code-model \
--tasks humaneval \
--trust_remote_code \
--use_auth_token \
--allow_code_execution
```
**Step 3: Run evaluation**
```bash
# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--temperature 0.8 \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution \
--save_generations \
--metric_output_path results/starcoder2-humaneval.json
```
**Step 4: Analyze results**
Results in `results/starcoder2-humaneval.json`:
```json
{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}
```
### Workflow 2: Multi-Language Evaluation (MultiPL-E)
Evaluate code generation across 18 programming languages.
**Checklist**:
```
Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages
```
**Step 1: Generate solutions on host**
```bash
# Generate without execution (safe)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--max_length_generation 650 \
--temperature 0.8 \
--n_samples 50 \
--batch_size 50 \
--generation_only \
--save_generations \
--save_generations_path generations_multi.json
```
**Step 2: Evaluate in Docker container**
```bash
# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
-it evaluation-harness-multiple python3 main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--load_generations_path /app/generations.json \
--allow_code_execution \
--n_samples 50
```
**Supported languages**: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket
### Workflow 3: Instruction-Tuned Model Evaluation
Evaluate chat/instruction models with proper formatting.
**Checklist**:
```
Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation
```
**Step 1: Choose instruction tasks**
- **instruct-humaneval**: HumanEval with instruction prompts
- **humanevalsynthesize-{lang}**: HumanEvalPack synthesis tasks
**Step 2: Configure instruction tokens**
```bash
# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks instruct-humaneval \
--instruction_tokens "<s>[INST],</s>,[/INST]" \
--max_length_generation 512 \
--allow_code_execution
```
**Step 3: HumanEvalPack for instruction models**
```bash
# Test code synthesis across 6 languages
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python,humanevalsynthesize-js \
--prompt instruct \
--max_length_generation 512 \
--allow_code_execution
```
### Workflow 4: Compare Multiple Models
Benchmark suite for model comparison.
**Step 1: Create evaluation script**
```bash
#!/bin/bash
# eval_models.sh
MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py \
--model $model \
--tasks $TASKS \
--temperature 0.2 \
--n_samples 20 \
--batch_size 20 \
-Vibe Code Orchestrator (VCO) is a governed runtime entry that freezes requirements, plans XL-first execution, and enforces verification and phase cleanup.
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Codex's capabilities with specialized knowledge, workflows, or tool integrations.
Install Codex skills into $CODEX_HOME/skills from a curated list or a GitHub repo path. Use when a user asks to list installable skills, install a curated skill, or install a skill from another repo (including private repos).
|
Cloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Access real-time and historical stock market data, forex rates, cryptocurrency prices, commodities, economic indicators, and 50+ technical indicators via the Alpha Vantage API. Use when fetching stock prices (OHLCV), company fundamentals (income statement, balance sheet, cash flow), earnings, options data, market news/sentiment, insider transactions, GDP, CPI, treasury yields, gold/silver/oil prices, Bitcoin/crypto prices, forex exchange rates, or calculating technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands). Requires a free API key from alphavantage.co.