devops-infrastructure
The devops-infrastructure skill provides guidance on containerization, CI/CD pipeline configuration, deployment strategies, infrastructure as code tools, and observability setup. Use this skill when writing Dockerfiles, configuring GitHub Actions workflows, planning deployment architectures, setting up monitoring systems, or answering questions about containers, Terraform, Kubernetes, production infrastructure, or related DevOps concerns.
git clone --depth 1 https://github.com/CloudAI-X/claude-workflow-v2 /tmp/devops-infrastructure && cp -r /tmp/devops-infrastructure/skills/devops-infrastructure ~/.claude/skills/devops-infrastructureSKILL.md
# DevOps & Infrastructure
### When to Load
- **Trigger**: Docker, CI/CD pipelines, deployment configuration, monitoring, infrastructure as code
- **Skip**: Application logic only with no infrastructure or deployment concerns
## DevOps Workflow
Copy this checklist and track progress:
```
DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklist
```
## Docker Best Practices
### Multi-Stage Build
```dockerfile
# WRONG: Single stage, bloated image
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD ["node", "dist/index.js"]
# Result: 1.2GB image with devDependencies and source code
# CORRECT: Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
# Result: ~150MB image, no devDependencies, non-root user
```
### Python Multi-Stage
```dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]
```
### Layer Caching
```dockerfile
# WRONG: Cache busted on every code change
COPY . .
RUN npm ci
# CORRECT: Dependencies cached separately
COPY package*.json ./
RUN npm ci # cached unless package.json changes
COPY . . # only source code changes bust this layer
```
### .dockerignore
```
node_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pyc
```
### Security
```dockerfile
# Always pin versions
FROM node:20.11.0-alpine # NOT node:latest
# Don't run as root
USER appuser
# Read-only filesystem where possible
# docker run --read-only --tmpfs /tmp myapp
# Scan images
# docker scout cves myimage:latest
# trivy image myimage:latest
```
## CI/CD Pipeline Design
### GitHub Actions Structure
```yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm test
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
push: ${{ github.event_name == 'push' }}
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment: production
steps:
- run: echo "Deploy to production"
```
### Caching Strategies
```yaml
# Node modules
- uses: actions/setup-node@v4
with:
cache: "npm"
# Python with uv
- name: Cache uv
uses: actions/cache@v4
with:
path: ~/.cache/uv
key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
# Docker layer caching
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
```
## Deployment Strategies
### Blue-Green Deployment
```
1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue
Pros: Instant rollback, zero downtime
Cons: 2x infrastructure cost during deploy
```
### Canary Deployment
```
1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version
Pros: Limited blast radius, real-world testing
Cons: More complex routing, longer rollout
```
### Rolling Deployment
```
1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated
Pros: No extra infrastructure, gradual rollout
Cons: Mixed versions during deploy, slower rollback
```
### Feature Flags
```typescript
// Simple feature flag implementation
const features = {
NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
DARK_MODE: process.env.FF_DARK_MODE === "true",
};
function getCheckoutFlow(user: User) {
if (features.NEW_CHECKOUT && user.betaGroup) {
return newCheckoutFlow(user);
}
return legacyCheckoutFlow(user);
}
// Use a proper service for production: LaunchDarkly, Unleash, Flagsmith
```
## Infrastructure as Code
### Terraform Basics
```hcl
# main.tf
terraform {
required_version = ">= 1.5"
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}Expert code review specialist. Use PROACTIVELY after writing or modifying code, before commits, when asked to review changes, PR review, code quality check, lint, or standards audit. Focuses on quality, security, performance, and maintainability.
Expert debugging specialist for errors, test failures, crashes, segmentation faults, memory leaks, timeouts, race conditions, deadlocks, and unexpected behavior. Use PROACTIVELY when encountering any error, exception, or failing test. Performs systematic root cause analysis.
Technical documentation specialist. Use for creating README files, API documentation, architecture docs, inline comments, user guides, changelogs, migration guides, release notes, FAQs, and troubleshooting docs. MUST BE USED when documentation is needed or when code changes require doc updates.
Master coordinator for complex multi-step tasks. Use PROACTIVELY when a task involves 2+ modules, requires delegation to specialists, needs architectural planning, or involves GitHub PR workflows. MUST BE USED for open-ended requests like "improve", "enhance", "build", "scale", "refactor", "add feature", "system design", "architecture", "complex task", or when implementing features from GitHub issues.
Code refactoring specialist for improving code quality, reducing technical debt, eliminating code smells, reducing complexity, and applying design patterns. Use PROACTIVELY when code needs restructuring, simplification, tech debt reduction, or when applying DRY/SOLID principles.
Security specialist for vulnerability detection, secure coding review, and security hardening. Use PROACTIVELY when handling authentication, authorization, encryption, secrets, credentials, OAuth, JWT, CORS, headers, user input, API keys, or sensitive data. Checks for OWASP Top 10 and common vulnerabilities.
Testing strategy specialist for designing test suites, writing tests, and ensuring comprehensive coverage. Use PROACTIVELY when adding new features, fixing bugs, improving test coverage, creating test plans, mocking strategies, handling flaky tests, or writing integration/E2E tests.
Add tests for recently changed files or specified code