Service catalog SLO handbook

How we wire SLOs into the catalog so ownership, alerts, and error budgets stay consistent.

Library shelves

A catalog entry without SLOs is just a name. We treat the catalog as the source of truth for reliability targets and let tooling sync the rest.

Model SLOs as code

Each service owns an SLO manifest stored next to the code. The catalog ingests it and stamps labels for alerts.

service: billing-api
owner: team-invoices
slos:
  availability:
    objective: 99.9
    window: 30d
    threshold:
      good_statuses: [200, 201, 204]
      latency_ms_p95: 350
  latency:
    objective: 99.0
    window: 7d
    threshold:
      latency_ms_p99: 800
# publish to the catalog
catalogctl push slo ./manifests/billing-api.slo.yaml

Derive alerts from SLO math

We avoid custom dashboards per service. Instead, alerts follow standard burn-rate windows.

rule "billing-availability-burn" {
  expr  = "sli_error_rate:ratio_rate5m > (1-0.999) * 14"
  for   = "2m"
  labels = { severity = "page", service = "billing-api" }
  annotations = {
    summary = "Billing availability SLO burn (fast)",
    runbook = "https://runbooks.internal/billing-api/slo-burn"
  }
}
rule "billing-availability-burn-slow" {
  expr  = "sli_error_rate:ratio_rate1h > (1-0.999) * 3"
  for   = "10m"
  labels = { severity = "ticket", service = "billing-api" }
}

Wire ownership to rotation data

Every SLO alert resolves to a rotation maintained in the catalog. No more guessing who is on point.

const { team, oncall } = await catalog.lookup("billing-api");
const responder = oncall.current.rotation[0];

return {
  summary: `Page ${responder} for SLO breach`,
  labels: { team },
};
-- report services missing SLO metadata
select service_name
from catalog_entries
where slos is null or jsonb_array_length(slos) = 0;

Keep budgets visible

We chart error budget burn in the same place teams find deployments and logs.

# generate a budget summary panel
catalogctl render slo-dashboard --service billing-api --window 30d > panels/billing.json
{
  "title": "Billing SLO burn",
  "panels": [
    { "type": "timeseries", "expr": "sli_error_budget_remaining" },
    { "type": "stat", "expr": "sli_availability:ratio_rate30d" }
  ]
}

The catalog becomes more than an index—it is the contract for how a service should behave and how quickly the team must respond when it drifts.