Skill1.2k repo starsupdated 1mo ago

gpu-monitor

The gpu-monitor skill quickly displays GPU availability and status by running nvidia-smi and parsing results into a summary table showing memory usage, utilization, temperature, and running processes per GPU. Use it to identify which GPUs are free for new experiments, monitor active training jobs, and assess resource availability on local or remote servers via SSH.

View source Repository: auto-deep-researcher-24x7

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7 /tmp/gpu-monitor && cp -r /tmp/gpu-monitor/skills/gpu-monitor ~/.claude/skills/gpu-monitor

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# gpu-monitor

Quick GPU status check for experiment management.

## Usage

```
Claude Code: /gpu-monitor
Claude Code: /gpu-monitor --server user@remote-host
Codex: $gpu-monitor
```

## Behavior

1. Run `nvidia-smi` to get current GPU status
2. Display a clean summary table:
   - GPU ID, Name, Memory (used/total), Utilization %, Temperature
   - Running processes on each GPU
3. Identify which GPUs are free (< 1GB memory used)
4. Identify which GPUs are running experiments (check for python/torchrun processes)
5. If `--server` is provided, SSH to remote server first

## Output Format

```
GPU Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 GPU  Name          Memory         Util  Temp
  0   L20X 144GB    45123/147456   98%   72°C  ← training (PID 12345)
  1   L20X 144GB      234/147456    0%   35°C  ← FREE
  2   L20X 144GB    43210/147456   95%   70°C  ← training (PID 12346)
  3   L20X 144GB     1024/147456   12%   40°C  ← keeper

Free GPUs: [1]
Training: GPU 0 (PID 12345), GPU 2 (PID 12346)
```