Skill414 repo starsupdated yesterday

system-performance-remediation

System Performance Remediation restores machine responsiveness by systematically diagnosing and eliminating runaway processes, cache bloat, and memory pressure. Use this skill when the system exhibits high CPU load, memory exhaustion, swapped disk thrashing, or zombie processes that prevent normal operation. The skill prioritizes killing confused parent agents and obviously useless processes before touching potentially valuable work, includes VM tuning recommendations, and provides copy-paste diagnostic commands for rapid triage.

View source Repository: agentops

Install in Claude Code

Copy

git clone --depth 1 https://github.com/boshu2/agentops /tmp/system-performance-remediation && cp -r /tmp/system-performance-remediation/images/gemini/skills/system-performance-remediation ~/.claude/skills/system-performance-remediation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

<!-- TOC: Quick Reference | VM Tuning & Cache Bloat | systemd-oomd Protection | Kill Hierarchy | Diagnosis | Swap & zram | Disk Cleanup | Zellij/Tmux Cleanup | Orphans | Agent Swarm Fix | Fleet Triage | Emergency | References -->

# System Performance Remediation

> **Core Principle:** First, do no harm. Kill OBVIOUSLY useless processes before touching anything potentially useful.

> **The Whack-a-Mole Anti-Pattern:**
> Killing child processes (cargo builds, tests) is POINTLESS if confused parent agents respawn them.
> **Kill the confused agents, not their children.**

---

## Quick Reference — Copy-Paste Commands

```bash
# === INSTANT DIAGNOSIS ===
uptime && nproc && cat /proc/pressure/cpu | head -1

# === ONE-LINER STATUS (includes swap + memory pressure) ===
echo "Load: $(uptime | awk -F'load average:' '{print $2}') / $(nproc) cores | Mem: $(free -h | awk '/Mem:/{print $3"/"$2}') | Swap: $(free -h | awk '/Swap:/{print $3"/"$2}') | Zombies: $(ps -eo stat | grep -c '^Z' || echo 0) | MemP: $(awk -F= '/some/{print $2}' /proc/pressure/memory | cut -d' ' -f1)%"

# === VM TUNING CHECK (catches cache bloat before it kills sessions) ===
sysctl vm.vfs_cache_pressure vm.min_free_kbytes && cat /proc/pressure/memory

# === FIND STUCK PROCESSES ===
ps -eo pid,etimes,pcpu,args --sort=-etimes | grep -E 'bun test|cargo test|vercel|git add' | awk '$2 > 3600'

# === FIND STALE GEMINI AGENTS (24+ hours) ===
ps -eo pid,etimes,pcpu,rss,args | grep 'bun.*gemini' | grep -v grep | awk '$2 > 86400 {print $1, int($2/3600)"h", $3"%", int($4/1024)"MB"}'

# === COUNT MCP SERVER BLOAT ===
ps aux | grep -E 'playwright|morphmcp' | grep -v grep | wc -l

# === FIND COMPETING BUILDS ===
ps aux | grep cc1plus | grep -oP 'target[^/]*/' | sort | uniq -c

# === FIND OLD AGENTS (16+ hours) ===
ps -eo pid,etimes,pcpu,args | grep -E 'claude --dangerously|codex --dangerously' | awk '$2 > 57600 {print $1, int($2/3600)"h", $3"%"}'

# === KILL OLD AGENTS (16+ hours) ===
ps -eo pid,etimes,args | grep -E 'claude|codex' | awk '$2 > 57600 {print $1}' | xargs -r kill

# === RENICE ALL COMPILATION ===
for pid in $(pgrep -f '/bin/cargo') $(pgrep cc1plus); do renice 19 -p $pid; ionice -c 3 -p $pid; done 2>/dev/null

# === ZELLIJ DEAD SESSION COUNT ===
zellij list-sessions 2>&1 | grep -c EXITED
```

---

## Kill Hierarchy (Safest First)

| Priority | Category | Examples | Risk |
|----------|----------|----------|------|
| 1 | **Zombies** | Defunct processes (Z state) | Zero — already dead |
| 2 | **Exited zellij/tmux sessions** | `zellij delete-all-sessions` | Zero — already exited |
| 3 | **Stuck tests** | `bun test`, `cargo test` 12+ hours | Low — idempotent |
| 4 | **Orphaned poll loops** | zsh shells waiting on files that never appear | Low — wasted CPU |
| 5 | **Stuck CLI** | `vercel inspect`, `git add .` 5+ min | Low — restart-safe |
| 6 | **Duplicate builds** | Multiple `cargo check` same project | Low — keep newest |
| 7 | **Old dev servers** | `next dev`, `bun --hot` idle 24+ hours | Low — restart-safe |
| 8 | **Stale gemini agents** | `bun gemini` running 24+ hours | Medium — likely stuck |
| 9 | **Old tmux sessions** | `ntm-*` no activity | Medium — check first |
| 10 | **Old agents** | `claude`, `codex` 16+ hours | Medium — likely stuck |
| 11 | **Active agents** | `claude`, `codex` <16 hours | High — doing work |
| 12 | **System processes** | NEVER TOUCH | Forbidden |

### Protected Patterns (NEVER KILL)

```
systemd, sshd, dbus, cron, docker, containerd
postgres, mysql, redis, elasticsearch, nginx, caddy
wezterm-mux-server  ← ABSOLUTELY NEVER TOUCH — holds ALL agent sessions
```

### SIGTERM vs SIGKILL

Some processes ignore SIGTERM. Always try SIGTERM first, wait 3s, escalate:

```bash
kill $PID; sleep 3; kill -0 $PID 2>/dev/null && kill -9 $PID
```

**Known SIGTERM-ignorers:** `bun test` — always needs SIGKILL after SIGTERM fails.

---

## VM Tuning & Filesystem Cache Bloat (The Silent Killer)

> **Real-world incident (2026-02-23):** On trj (499GB RAM, btrfs), `vfs_cache_pressure=50` let btrfs
> inode/dentry caches balloon to 388GB page cache + 40GB slab. Memory pressure hit 18%.
> `systemd-oomd` killed `user@1000.service`, destroying the mux server and **all 382 agent sessions**
> instantly. The fix: `vfs_cache_pressure=200` + `min_free_kbytes=2GB` + drop caches.
> Pressure dropped from 18% to 2.4% in minutes.

### The Cache Bloat Pattern

High-RAM machines with many agents accumulate massive filesystem caches. The kernel hoards dentries, inodes, and page cache (especially on btrfs). This creates **memory pressure even with "free" RAM** because the kernel's reclaim paths stall under pressure.

**Symptoms:**
- System feels sluggish despite `free -h` showing lots of "available" RAM
- `/proc/pressure/memory` shows sustained avg10 > 5% (the key metric!)
- `kcompactd0` running at 2-5% CPU continuously
- Slab cache (`cat /proc/meminfo | grep Slab`) is 20-40+ GB
- `vmstat 1 3` shows high `si`/`so` or `bi`/`bo` in first sample

### Diagnose Cache Bloat

```bash
# 1. Check memory pressure (THE critical metric)
cat /proc/pressure/memory
# some avg10=18.78 → 18.78% of time tasks stalled on memory = BAD

# 2. Check VM tuning
sysctl vm.vfs_cache_pressure vm.min_free_kbytes

# 3. Check slab breakdown
sudo slabtop -o -s c | head -15
# Look for: btrfs_inode (GB), radix_tree_node (GB), dentry (GB), ext4_inode_cache (GB)

# 4. Check page cache vs actual usage
grep -E "Cached|Slab|SReclaimable|SUnreclaim|Dirty|MemAvail" /proc/meminfo

# 5. Check kcompactd (memory compaction daemon — should be ~0% CPU)
ps -o pid,pcpu,etime,cmd -p $(pgrep kcompactd) 2>/dev/null
```

### Fix: Tune VM Parameters

**Settings by filesystem and RAM size:**

| Machine Type | FS | vfs_cache_pressure | min_free_kbytes | Notes |
|-------------|-----|-------------------|-----------------|-------|
| 499GB btrfs | btrfs | **200** | **2GB** (2097152) | btrfs caches are aggressive |
| 251GB ext4 | ext4 | **150** | **1-2GB** (1048576-2097152) | ext4 is less cache-heavy |
|