Skill69 repo starsupdated 8d ago

reliability-engineering-cloud

This skill provides a systematic framework for managing cloud service reliability through Service Level Indicators (SLIs), Objectives (SLOs), error budgets, incident response procedures, and NASA Systems Engineering phase gates adapted for cloud operations. Use it when establishing availability targets for a new service, responding to production incidents, designing runbooks, preparing launch readiness reviews, or implementing formal reliability discipline at scale with chaos engineering and blameless postmortem practices.

View source Repository: gsd-skill-creator

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Tibsfox/gsd-skill-creator /tmp/reliability-engineering-cloud && cp -r /tmp/reliability-engineering-cloud/examples/skills/cloud-systems/reliability-engineering-cloud ~/.claude/skills/reliability-engineering-cloud

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Reliability Engineering for Cloud

Reliability is a property of a system, not of any individual component. A cloud service that meets its availability target is not necessarily built from components that individually meet their target — it is built from a system that absorbs component failures without exposing them to users. This skill covers the SRE toolkit (SLIs, SLOs, error budgets, runbooks, postmortems, chaos engineering) alongside NASA's Systems Engineering methodology for design reviews and verification, because cloud operations at serious scale have to borrow discipline from somewhere, and aerospace is the field that has been figuring this out the longest.

**Agent affinity:** gray (transaction processing reliability, ACID foundations), hamilton-cloud (SRE economics at AWS scale), lamport (formal safety arguments)

**Concept IDs:** cloud-se-phase-reviews, cloud-taid-verification, cloud-runbook-structure, cloud-procedure-execution, cloud-communication-loops

## Service Level Indicators, Objectives, and Agreements

**SLI (Service Level Indicator).** A quantitative measure of some aspect of service behavior. Examples: fraction of requests that succeed, fraction of requests served in under 200 ms, bytes delivered, freshness of returned data.

**SLO (Service Level Objective).** A target for an SLI over a window. "99.9% of requests succeed over a 30-day window." SLOs are set internally by the team that operates the service.

**SLA (Service Level Agreement).** A contractual commitment, typically looser than the SLO, with financial consequences for violation. SLAs are customer-facing; SLOs are the internal discipline that prevents you from ever needing to pay out on an SLA.

### Choosing SLIs

Good SLIs are:

- **User-centric.** Measure what users experience, not internal plumbing.
- **Ratio-based.** Expressed as "good events / total events" so they are interpretable across traffic levels.
- **Implementable.** Can be measured from existing data without instrumenting every code path.

Typical SLIs for a request/response service:

- Availability: fraction of non-5xx responses.
- Latency: fraction of responses under X ms.
- Quality: fraction of responses returning full results (not degraded).

## Error Budgets

The error budget is the inverse of the SLO. If the SLO is 99.9% availability, the error budget is 0.1% — about 43 minutes per month of downtime.

Error budgets are the unit of negotiation between "ship features" and "improve reliability." When the error budget is being spent faster than planned, the response is to slow feature velocity and address reliability. When the error budget is being preserved, new features and risk-taking are encouraged. This removes the "SRE says no, developers say yes" political argument and replaces it with a shared metric.

**Burn rate alerting.** Alert when the error budget is being consumed faster than the window allows. A 2% burn over 1 hour (30 days of error budget consumed in 24 hours) is worth paging; a 0.1% burn over a week is worth a ticket.

## Runbooks

A runbook is an executable procedure for a specific operational task: deploying a service, recovering from a known failure mode, running a database migration, rotating a credential. Good runbooks share structure:

1. **Title and scope.** What this runbook does and does not cover.
2. **Prerequisites.** State required before execution (access, data, approvals).
3. **Steps.** Each step has an action, an expected output, a verification check, and a timeout.
4. **Rollback.** What to do if the procedure fails partway.
5. **Escalation.** Who to contact when the runbook doesn't match the situation.
6. **Last reviewed.** Date and reviewer.

Runbooks that are out of date are worse than missing runbooks — they give false confidence. Schedule reviews.

### Procedure Step Discipline

Every step is a contract: "if I do this, I expect to see that, within this time." When the observed result differs, the operator stops and escalates rather than continuing on autopilot. This is borrowed from aviation crew resource management and nuclear plant operations — fields where continuing on a procedure mismatch is what causes accidents.

## Incident Response

An incident is an unplanned disruption that requires coordinated response. Structure:

**Detection.** Monitoring fires an alert. Ideal: alert fires before users notice. Acceptable: alert fires within seconds of the first user complaint.

**Triage.** Incident commander (IC) is designated. Severity is assessed. Initial scope is determined (which users, which regions, which services).

**Mitigation.** Stop the bleeding. Roll back the last change. Drain traffic from the affected region. Fail over to a backup. Mitigation is not a fix — it is a restoration of service so that the fix can be done calmly.

**Resolution.** The underlying cause is addressed and the system is returned to normal operation.

**Postmortem.** A blameless write-up of what happened, why, and what to do differently.

### The Incident Commander Role

The IC does not fix the incident. The IC coordinates: assigns tasks, tracks who is doing what, communicates with stakeholders, decides when to declare the incident resolved. Without an IC, everyone tries to fix things simultaneously, communication breaks down, and mitigation takes hours longer than it should.

## Blameless Postmortems

A postmortem is a report on an incident. It should answer:

- What happened? (Timeline of events.)
- What was the impact?
- What went well?
- What went poorly?
- Where did we get lucky?
- What are the action items, with owners and due dates?

"Blameless" means the narrative avoids assigning fault to individuals. The goal is not "Alice shouldn't have deployed on Friday" — the goal is "our deploy system should have caught this regression." People who fear blame hide their mistakes, and hidden mistakes are the ones that kill systems.

## Chaos Engineering

Chaos engineering is the discipline of testing the system by deliberate

More from this repository

art-history-movementsSkill

Major art movements and their historical context for art education. Covers 12 movements from the Renaissance to contemporary art, their defining characteristics, key artists, signature works, and the intellectual/social forces that produced them. Use when analyzing artworks in historical context, understanding stylistic lineages, identifying influences across periods, or connecting studio practice to art-historical precedent.

color-theorySkill

Color theory principles for art education. Covers the three color properties (hue, saturation, value), color mixing systems (subtractive and additive), color relationships (complementary, analogous, triadic, split-complementary), color temperature, simultaneous contrast and the relativity of color perception, and practical palette construction. Use when analyzing color in artworks, planning color schemes, understanding optical phenomena in painting, or investigating Albers's Interaction of Color experiments.

creative-processSkill

The creative process in art from idea to exhibition. Covers five phases of creative work (inspiration, incubation, exploration, execution, reflection), sketchbook practice, artist statements, critique methodology (formal and conceptual), portfolio development, and the studio as a working environment. Use when guiding students through project development, facilitating critique sessions, developing artist statements, curating portfolios, or understanding how professional artists structure their creative practice.

digital-artSkill

Digital art tools, techniques, and workflows for art education. Covers raster and vector workflows, digital painting, photo manipulation, generative and procedural art, 3D modeling and rendering, pixel art, the relationship between traditional skills and digital execution, and ethical considerations of AI-generated imagery. Use when working with digital tools, evaluating digital art, or bridging traditional art concepts into digital practice.

drawing-observationSkill

Observational drawing and visual perception techniques for art education. Covers contour drawing, gesture drawing, negative space, proportion and measurement, value mapping, spatial depth cues, and the cognitive shift from symbolic to perceptual seeing. Use when teaching drawing fundamentals, analyzing observational accuracy, or developing visual literacy in any medium.

sculpture-3dSkill

Three-dimensional art and sculptural thinking for art education. Covers additive and subtractive sculptural processes, armature construction, modeling in clay, carving principles, casting and moldmaking, assemblage and found-object sculpture, installation art as expanded sculpture, and the conceptual transition from pictorial to spatial thinking. Use when working with three-dimensional media, analyzing sculptural form, understanding spatial composition, or investigating the relationship between sculpture and site.

celestial-coordinatesSkill

Celestial coordinate systems and sky positioning. Covers horizon (altitude-azimuth), equatorial (right ascension-declination), ecliptic, and galactic systems; epoch and precession; coordinate transformations; planisphere use; and practical sky-locating from any latitude and date. Use when locating objects, planning observations, converting catalog coordinates, or teaching the geometry of the sky.

cosmological-observationSkill

Observational cosmology from Hubble's law to the CMB. Covers redshift, Hubble expansion, the cosmological parameters, the cosmic microwave background, large-scale structure, galaxy rotation curves and dark matter, Type Ia SNe and dark energy, and the current state of Lambda-CDM. Use when reasoning about the large-scale universe, interpreting cosmological surveys, or teaching the Big Bang evidence chain.