Skill340 repo starsupdated 8d ago

kubernetes-skill

This skill diagnoses and resolves six critical Kubernetes failure modes: insecure workload defaults, resource starvation, network exposure, privilege sprawl, fragile rollouts, and API drift. Use it when authoring, reviewing, refactoring, or migrating Kubernetes manifests across multiple deployment methods (raw YAML, Helm, Kustomize) and cloud distributions (EKS, GKE, AKS, OpenShift, vanilla) to prevent misconfigurations that cause security breaches, scheduling failures, or unstable rollouts.

View source Repository: kubernetes-skill

Install in Claude Code

Copy

git clone https://github.com/LukasNiessen/kubernetes-skill ~/.claude/skills/kubernetes-skill

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# KubeShark: Failure-Mode Workflow for Kubernetes

Run this workflow top to bottom.

## 1) Capture execution context

Record before writing manifests:
- cluster version (e.g. 1.30, 1.31) and distribution (EKS, GKE, AKS, k3s, vanilla)
- target namespace and environment criticality (dev/staging/prod)
- workload type (Deployment, StatefulSet, Job, CronJob, DaemonSet)
- deployment method (raw YAML, Helm, Kustomize, operator-managed)
- policy enforcement (Pod Security Admission level, Kyverno, OPA/Gatekeeper)
- cloud provider and CNI (affects networking, storage classes, load balancers)
- platform controllers/add-ons (GitOps, observability, ingress, service mesh, autoscaling)

If unknown, state assumptions explicitly.

## 2) Diagnose likely failure mode(s)

Select one or more based on user intent and risk:
- insecure workload defaults: missing security contexts, PSS violations, host access
- resource starvation: missing requests/limits, no PDB, scheduling chaos
- network exposure: flat networking, missing policies, wrong Service types, DNS issues
- privilege sprawl: overly permissive RBAC, leaked secrets, excess ServiceAccount rights
- fragile rollouts: misconfigured probes, mutable tags, unsafe update strategies
- API drift: wrong apiVersion, deprecated APIs, schema violations, tool-specific errors

## 3) Load only the relevant reference file(s)

Primary failure-mode references:
- `references/insecure-workload-defaults.md`
- `references/resource-starvation.md`
- `references/network-exposure.md`
- `references/privilege-sprawl.md`
- `references/fragile-rollouts.md`
- `references/api-drift.md`

Supplemental references (only when needed):
- `references/deployment-patterns.md`
- `references/stateful-patterns.md`
- `references/job-patterns.md`
- `references/daemonset-operator-patterns.md`
- `references/security-hardening.md`
- `references/observability.md`
- `references/multi-tenancy.md`
- `references/storage-and-state.md`
- `references/helm-patterns.md`
- `references/kustomize-patterns.md`
- `references/validation-and-policy.md`
- `references/examples-good.md`
- `references/examples-bad.md`
- `references/do-dont-patterns.md`

Conditional Reference Retrieval (CRR) references (load only when the signal is detected):
- `references/conditional/eks-patterns.md` for EKS, AWS, IRSA, EKS Pod Identity, AWS Load Balancer Controller, EBS/EFS CSI, Karpenter
- `references/conditional/gke-patterns.md` for GKE, Autopilot, Workload Identity Federation for GKE, Dataplane V2, GCE Ingress, Config Sync
- `references/conditional/aks-patterns.md` for AKS, Microsoft Entra Workload ID, Azure CNI, AGIC, Azure Disk/File/Blob CSI
- `references/conditional/openshift-patterns.md` for OpenShift, OKD, ROSA, ARO, Routes, SCCs, OLM, `oc`
- `references/conditional/gitops-controllers.md` for Argo CD, ApplicationSet, Flux, GitOps reconciliation, sync waves
- `references/conditional/observability-stacks.md` for Prometheus Operator, ServiceMonitor, PodMonitor, OpenTelemetry, Loki, Grafana

Do not load multiple CRR files unless the task spans multiple detected platforms/tools.

## 4) Propose fix path with explicit risk controls

For each fix, include:
- why this addresses the failure mode
- what could still go wrong at deploy time or runtime
- guardrails (validation commands, policy checks, rollback path)

## 5) Generate implementation artifacts

When applicable, output:
- Kubernetes manifests (YAML with security contexts, resource limits, labels)
- Helm values/templates or Kustomize overlays
- NetworkPolicies, RBAC resources, PodDisruptionBudgets
- Policy rules (Kyverno/OPA) and admission controls

## 6) Validate before finalize

Always provide validation steps tailored to deployment method and risk tier:
- `kubectl apply --dry-run=server` or `kubectl diff`
- `kubeconform` for schema validation against target cluster version
- cross-resource consistency check (label/selector/port alignment)
- policy scan (PSS profile check, Kyverno/OPA audit)
Never recommend direct production apply without reviewed diff and approval.

## 7) Output contract

Return:
- assumptions and cluster version floor
- selected failure mode(s)
- chosen remediation and tradeoffs
- validation/test plan
- rollback/recovery notes (rollout undo, revision history, data safety)