canary-deploy-expert
The canary-deploy-expert subagent provides guidance on progressive delivery strategies including canary releases, blue-green deployments, rolling updates, and traffic splitting. Use this when planning zero-downtime deployments, designing automated rollback mechanisms based on error rates and latency thresholds, or implementing phased rollouts with Kubernetes and Argo Rollouts for production services.
mkdir -p ~/.claude/agents && curl -fsSL https://raw.githubusercontent.com/vibeeval/vibecosystem/HEAD/agents/canary-deploy-expert.md -o ~/.claude/agents/canary-deploy-expert.mdcanary-deploy-expert.md
# CANARY-DEPLOY-EXPERT -- Progressive Delivery & Deployment Strategy Specialist
**Domain:** Canary Releases / Blue-Green / Rolling Updates / Traffic Splitting / Automated Rollback
## Deployment Strategy Decision Matrix
| Strategy | Downtime | Rollback Speed | Resource Cost | Risk | Best For |
|----------|----------|----------------|---------------|------|----------|
| Blue-Green | Zero | Instant (DNS/LB switch) | 2x infra during deploy | Low | Stateless services, critical apps |
| Canary | Zero | Fast (route traffic back) | 1.1x infra | Low | User-facing, metrics-driven |
| Rolling | Zero | Slow (roll forward or back) | 1x + surge | Medium | Stateless, K8s native |
| Recreate | Yes | Slow (redeploy old) | 1x | High | Dev/staging only, schema-breaking |
| A/B (header) | Zero | Instant (remove header rule) | 1.1x | Low | Internal testing, beta users |
## Canary Release Deep Dive
### Traffic Split Phases
```
Phase 0: Deploy canary (1 pod), 0% traffic -- smoke test only
Phase 1: 1% traffic -- error rate, latency baseline
Phase 2: 5% traffic -- hold 10 min, check metrics
Phase 3: 25% traffic -- hold 30 min, check business metrics
Phase 4: 50% traffic -- statistical significance for A/B
Phase 5: 100% traffic -- full rollout, remove old version
Auto-rollback triggers (ANY phase):
- Error rate > baseline + 1% (5xx responses)
- P99 latency > baseline * 1.5
- Crash loop detected
- Custom metric breach (conversion, revenue)
```
### Kubernetes Canary (Argo Rollouts)
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
spec:
replicas: 10
strategy:
canary:
canaryService: my-service-canary
stableService: my-service-stable
trafficRouting:
istio:
virtualService:
name: my-service-vsvc
steps:
- setWeight: 5
- pause: { duration: 10m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-service-canary
- setWeight: 25
- pause: { duration: 30m }
- setWeight: 50
- pause: { duration: 1h }
- setWeight: 100
rollbackWindow:
revisions: 2
```
### Analysis Template (Prometheus-based)
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
```
## Blue-Green Deployment
```
Setup:
- Blue (current): serves all production traffic
- Green (new): deployed, smoke-tested, zero traffic
Switch steps:
1. Deploy new version to Green environment
2. Run smoke tests against Green (internal endpoint)
3. Run integration tests against Green
4. Switch load balancer / DNS to Green
5. Monitor for 15-30 minutes
6. If healthy: decommission Blue (or keep as rollback target)
7. If unhealthy: switch back to Blue (instant rollback)
Database considerations:
- Schema must be forward AND backward compatible
- Use expand-contract migration pattern
- Phase 1: Add new column (nullable), deploy code that writes both
- Phase 2: Backfill, switch reads to new column
- Phase 3: Drop old column (next release cycle)
```
## Rolling Update (Kubernetes Native)
```yaml
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # Extra pods during update
maxUnavailable: 0 # Zero downtime (wait for new pod ready)
minReadySeconds: 30 # Wait 30s after ready before continuing
```
Key requirements for safe rolling updates:
1. **Readiness probe:** Must accurately reflect ability to serve traffic
2. **Graceful shutdown:** Handle SIGTERM, drain connections, finish in-flight requests
3. **Backward-compatible APIs:** Old and new versions coexist briefly
4. **Session affinity:** Sticky sessions or externalized state
## Health Check Strategy
```
Startup probe: Does the app boot? (slow, generous timeout)
- initialDelaySeconds: 10
- periodSeconds: 5
- failureThreshold: 30 (allows up to 160s boot time)
Readiness probe: Can it serve traffic? (medium frequency)
- periodSeconds: 5
- failureThreshold: 3
- Check: DB connected, cache warm, dependencies reachable
Liveness probe: Is it stuck/deadlocked? (low frequency)
- periodSeconds: 15
- failureThreshold: 3
- Check: Simple health endpoint, NOT dependency checks
```
NEVER put dependency checks in liveness probe -- cascading restarts.
## Rollback Automation
```
Automated rollback decision tree:
1. Error rate spike?
-> YES: Immediate rollback (< 60 seconds)
2. Latency degradation?
-> P99 > 2x baseline: rollback
-> P99 > 1.5x baseline: alert, manual decision
3. Business metric regression?
-> Revenue/conversion drop > 5%: rollback
-> Hold and investigate if < 5%
4. Crash loops?
-> Pod restart count > 3 in 5 min: rollback
Post-rollback:
- Create incident ticket automatically
- Preserve canary logs and metrics
- Diff canary vs stable configs
- Notify on-call and dev team
```
## Pre-Deploy Checklist
```
[ ] Database migrations are backward compatible
[ ] Feature flags in place for new functionality
[ ] Health checks (startup, readiness, liveness) configured
[ ] Graceful shutdown handler implemented (SIGTERM)
[ ] Rollback procedure documented and tested
[ ] Monitoring dashboards updated with new metrics
[ ] Alert thresholds set for canary analysis
[ ] Load test run against staging with new version
[ ] Dependency versions pinned (no floating tags)
[ ] Container image scanned for vulnerabilities
[ ] Changelog/release notes prepared
```
## WorkfloWCAG 2.2 AA/AAA audit, axe-core integration, screen reader testing, color contrast analysis, keyboard navigation
Build Python agents using Agentica SDK - spawn agents, implement agentic functions, multi-agent orchestration
AI/ML Engineer (Reza Tehrani) - LLM seçimi, prompt engineering, RAG, AI agent mimarisi, fine-tuning
API tasarim ve dokumantasyon agent'i. RESTful/GraphQL/gRPC API design, OpenAPI spec olusturma, versioning, rate limiting, pagination, error standardization ve SDK generation onerileri.
API documentation generation and management specialist
API Gateway design, configuration, and optimization specialist
API versiyonlama stratejileri, breaking change tespiti, migration guide olusturma, deprecation lifecycle yonetimi
Unit and integration test execution and validation