Skip to main content
ClaudeWave
Skill374 repo starsupdated 6mo ago

operating-kubernetes

This skill provides operational frameworks for managing production Kubernetes clusters, covering resource management, advanced scheduling, networking, storage, security hardening, and autoscaling strategies. Use it when deploying applications to Kubernetes, configuring resource requests and limits, implementing NetworkPolicies, setting up autoscaling mechanisms, managing persistent storage, or troubleshooting operational failures like pod eviction or resource exhaustion.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/operating-kubernetes && cp -r /tmp/operating-kubernetes/skills/operating-kubernetes ~/.claude/skills/operating-kubernetes
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Kubernetes Operations

## Purpose

Operating Kubernetes clusters in production requires mastery of resource management, scheduling patterns, networking architecture, storage strategies, security hardening, and autoscaling. This skill provides operations-first frameworks for right-sizing workloads, implementing high-availability patterns, securing clusters with RBAC and Pod Security Standards, and systematically troubleshooting common failures.

Use this skill when deploying applications to Kubernetes, configuring cluster resources, implementing NetworkPolicies for zero-trust security, setting up autoscaling (HPA, VPA, KEDA), managing persistent storage, or diagnosing operational issues like CrashLoopBackOff or resource exhaustion.

## When to Use This Skill

**Common Triggers:**
- "Deploy my application to Kubernetes"
- "Configure resource requests and limits"
- "Set up autoscaling for my pods"
- "Implement NetworkPolicies for security"
- "My pod is stuck in Pending/CrashLoopBackOff"
- "Configure RBAC with least privilege"
- "Set up persistent storage for my database"
- "Spread pods across availability zones"

**Operations Covered:**
- Resource management (CPU/memory, QoS classes, quotas)
- Advanced scheduling (affinity, taints, topology spread)
- Networking (NetworkPolicies, Ingress, Gateway API)
- Storage operations (StorageClasses, PVCs, CSI)
- Security hardening (RBAC, Pod Security Standards, policies)
- Autoscaling (HPA, VPA, KEDA, cluster autoscaler)
- Troubleshooting (systematic debugging playbooks)

## Resource Management

### Quality of Service (QoS) Classes

Kubernetes assigns QoS classes based on resource requests and limits:

**Guaranteed (Highest Priority):**
- Requests equal limits for CPU and memory
- Never evicted unless exceeding limits
- Use for critical production services

```yaml
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "512Mi"  # Same as request
    cpu: "500m"
```

**Burstable (Medium Priority):**
- Requests less than limits (or only requests set)
- Can burst above requests
- Evicted under node pressure
- Use for web servers, most applications

```yaml
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"  # 2x request
    cpu: "500m"
```

**BestEffort (Lowest Priority):**
- No requests or limits set
- First to be evicted under pressure
- Use only for development/testing

### Decision Framework: Which QoS Class?

| Workload Type | QoS Class | Configuration |
|---------------|-----------|---------------|
| Critical API/Database | Guaranteed | requests == limits |
| Web servers, services | Burstable | limits 1.5-2x requests |
| Batch jobs | Burstable | Low requests, high limits |
| Dev/test environments | BestEffort | No limits |

### Resource Quotas and LimitRanges

Enforce multi-tenancy with ResourceQuotas (namespace limits) and LimitRanges (per-container defaults):

```yaml
# ResourceQuota: Namespace-level limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
```

For detailed resource management patterns including Vertical Pod Autoscaler (VPA), see `references/resource-management.md`.

## Advanced Scheduling

### Node Affinity

Control which nodes pods schedule on with required (hard) or preferred (soft) constraints:

```yaml
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - g4dn.xlarge  # GPU instance
```

### Taints and Tolerations

Reserve nodes for specific workloads (inverse of affinity):

```bash
# Taint GPU nodes to prevent non-GPU workloads
kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule
```

```yaml
# Pod tolerates GPU taint
tolerations:
- key: "workload"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"
```

### Topology Spread Constraints

Distribute pods evenly across failure domains (zones, nodes):

```yaml
topologySpreadConstraints:
- maxSkew: 1  # Max difference in pod count
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: critical-app
```

For advanced scheduling patterns including pod priority and preemption, see `references/scheduling-patterns.md`.

## Networking

### NetworkPolicies (Zero-Trust Security)

Implement default-deny security with NetworkPolicies:

```yaml
# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
```

```yaml
# Allow specific ingress (frontend → backend)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-allow-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
```

### Ingress vs. Gateway API

**Ingress (Legacy):**
- Widely supported, mature ecosystem
- Limited expressiveness
- Use for existing applications

**Gateway API (Modern):**
- Role-oriented design (cluster ops vs. app devs)
- More expressive (HTTPRoute, TCPRoute, TLSRoute)
- Recommended for new applications (GA in Kubernetes 1.29+)

```yaml
# Gateway API example
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app-routes
spec:
  parentRefs:
  - name: production-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: backend
      port: 8080
```

For detailed networking patterns including service mesh integration, see `references/networking.md`.

## Storage

### StorageClasses (Define Performance Tiers)

StorageClasses define storage tiers for different wor
administering-linuxSkill

Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.

ai-data-engineeringSkill

Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).

architecting-dataSkill

Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks.

architecting-networksSkill

Design cloud network architectures with VPC patterns, subnet strategies, zero trust principles, and hybrid connectivity. Use when planning VPC topology, implementing multi-cloud networking, or establishing secure network segmentation for cloud workloads.

architecting-securitySkill

Design comprehensive security architectures using defense-in-depth, zero trust principles, threat modeling (STRIDE, PASTA), and control frameworks (NIST CSF, CIS Controls, ISO 27001). Use when designing security for new systems, auditing existing architectures, or establishing security governance programs.

assembling-componentsSkill

Assembles component outputs from AI Design Components skills into unified, production-ready component systems with validated token integration, proper import chains, and framework-specific scaffolding. Use as the capstone skill after running theming, layout, dashboard, data-viz, or feedback skills to wire components into working React/Next.js, Python, or Rust projects.

building-ai-chatSkill

Builds AI chat interfaces and conversational UI with streaming responses, context management, and multi-modal support. Use when creating ChatGPT-style interfaces, AI assistants, code copilots, or conversational agents. Handles streaming text, token limits, regeneration, feedback loops, tool usage visualization, and AI-specific error patterns. Provides battle-tested components from leading AI products with accessibility and performance built in.

building-ci-pipelinesSkill

Constructs secure, efficient CI/CD pipelines with supply chain security (SLSA), monorepo optimization, caching strategies, and parallelization patterns for GitHub Actions, GitLab CI, and Argo Workflows. Use when setting up automated testing, building, or deployment workflows.