Skill11.2k repo starsupdated 1mo ago

nemo-evaluator-sdk

The nemo-evaluator-sdk is NVIDIA's containerized benchmarking platform that evaluates language models across 100+ standardized tasks including MMLU, HumanEval, GSM8K, and safety evaluations. Use it when requiring reproducible, scalable LLM assessment across multiple execution backends such as local Docker, Slurm HPC clusters, or cloud platforms with minimal configuration.

View source Repository: AI-Research-SKILLs

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Orchestra-Research/AI-Research-SKILLs /tmp/nemo-evaluator-sdk && cp -r /tmp/nemo-evaluator-sdk/11-evaluation/nemo-evaluator ~/.claude/skills/nemo-evaluator-sdk

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# NeMo Evaluator SDK - Enterprise LLM Benchmarking

## Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

**Installation**:
```bash
pip install nemo-evaluator-launcher
```

**Set API key and run evaluation**:
```bash
export NGC_API_KEY=nvapi-your-key-here

# Create minimal config
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
```

**View available tasks**:
```bash
nemo-evaluator-launcher ls tasks
```

## Common Workflows

### Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

**Checklist**:
```
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
```

**Step 1: Configure API endpoint**

```yaml
# config.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY
```

For self-hosted endpoints (vLLM, TRT-LLM):
```yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local
```

**Step 2: Select benchmarks**

Add tasks to your config:
```yaml
evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation
```

**Step 3: Run evaluation**

```bash
# Run with config file
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config

# Override output directory
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o execution.output_dir=./my_results

# Limit samples for quick testing
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
```

**Step 4: Check results**

```bash
# Check job status
nemo-evaluator-launcher status <invocation_id>

# List all runs
nemo-evaluator-launcher ls runs

# View results
cat results/<invocation_id>/<task>/artifacts/results.yml
```

### Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

**Checklist**:
```
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
```

**Step 1: Configure Slurm settings**

```yaml
# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8
```

**Step 2: Set up model deployment**

```yaml
deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment
```

**Step 3: Launch evaluation**

```bash
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config
```

**Step 4: Monitor job status**

```bash
# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>

# View detailed info
nemo-evaluator-launcher info <invocation_id>

# Kill if needed
nemo-evaluator-launcher kill <invocation_id>
```

### Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

**Checklist**:
```
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
```

**Step 1: Create base config**

```yaml
# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval
```

**Step 2: Run evaluations with model overrides**

```bash
# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# Evaluate Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
```

**Step 3: Export and compare**

```bash
# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
```

### Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

**Checklist**:
```
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
```

**Step 1: Configure safety tasks**

```yaml
evaluation:
  tasks:
    - name: aegis

More from this repository

autoresearchSkill

Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.

implementing-llms-litgptSkill

Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.

mamba-architectureSkill

State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.

nanogptSkill

Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).

rwkv-architectureSkill

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

distributed-llm-pretraining-torchtitanSkill

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

huggingface-tokenizersSkill

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

sentencepieceSkill

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.