# AI SRE — investigate and fix prod issues from Slack

Wire up logs, recent deploys, and the codebase. The agent investigates alerts from Slack, finds the root cause, and can open the fix as a PR.


<video src="/docs/ai-sre.mp4" preload="metadata" aria-label="Nairi AI SRE agent investigating a 5xx incident in Slack and opening a fix PR" className="rounded-xl overflow-hidden w-full" />

## Outcome [#outcome]

An on-call agent in a Slack channel that can take an alert (or a teammate's ping), pull logs and traces, correlate them with recent deploys, identify the root cause, and submit a fix as a pull request for human review. The human stays in Slack the whole time and only acts as the approver.

## Prerequisites [#prerequisites]

* An observability source: Sentry, Datadog, Render, or anything with an MCP server.
* A Slack channel for the agent. `#incidents` works well.
* GitHub access to the affected repos via Nairi's GitHub App.
* An LLM provider connected. Claude Sonnet via the Anthropic integration is the recommended pick for this playbook. See [Connecting your LLM provider](/help/integrations/llm-provider).

## Step 1 — Connect GitHub [#step-1--connect-github]

The agent needs the codebase to investigate the failing code path and to push the fix.

Install the Nairi GitHub App at [Settings → Integrations → GitHub](https://app.nairi.ai/settings/integrations) and grant access to the repos the agent should be able to read and open PRs against. Scope this tightly. If the agent is on-call for `payments-service` and `order-service`, give it those two repos and nothing else.

See [Connecting GitHub](/help/integrations/github) for the full walkthrough.

## Step 2 — Add the observability MCPs [#step-2--add-the-observability-mcps]

### Sentry and Datadog (one-click from the Marketplace) [#sentry-and-datadog-one-click-from-the-marketplace]

Both Sentry and Datadog are available as one-click OAuth toolkits on the [MCP Marketplace](/help/mcp#from-the-marketplace-managed-by-composio). Open [Settings → MCP Marketplace](https://app.nairi.ai/settings/mcp-marketplace), pick **Sentry** and **Datadog**, and connect each. OAuth handles the credentials, so there's nothing to vault or paste as JSON. The connections live at the org level, so any agent can use them.

### Deploy platform (Render shown) [#deploy-platform-render-shown]

If you want the agent to correlate spikes with recent deploys, give it read access to your deploy platform. Render ships an [official hosted MCP](https://render.com/docs/mcp-server) — point Nairi at the URL with an API key from [Render → Account Settings → API Keys](https://dashboard.render.com/u/settings).

Open [Settings → Artifacts → MCP Configs](https://app.nairi.ai/settings/artifacts) → **New**, mark the config as **sensitive**, and paste:

```json
{
  "mcpServers": {
    "render": {
      "url": "https://mcp-server.render.com",
      "env": {
        "RENDER_API_KEY": "YOUR_RENDER_API_KEY"
      }
    }
  }
}
```

Vercel, Heroku, and Fly all ship comparable MCP servers; check each vendor's docs for the canonical config.

You can attach any combination — start with the one observability tool your team already uses and add more only when you find the agent's investigations missing context.

## Step 3 — Create the agent [#step-3--create-the-agent]

Open [Fleet → New agent](https://app.nairi.ai/agents/fleet) and fill in:

* **Name** — `sre` works. The team mentions the Nairi bot (`@Nairi`) in a channel bound to this agent — the agent name shows up in the agent picker, not in the Slack mention.
* **LLM integration** — Anthropic (Claude Sonnet). Multi-step investigations benefit from Sonnet over smaller models.
* **Repository** — the primary service repo. The agent clones this into a fresh worktree per task.
* **MCPs** — attach the observability and deploy-platform configs from Step 2.
* **Channel** — bind to `#incidents`, or leave it and let the channel self-bind on the first `@`-mention.

See [How to deploy an agent](/help/building-agents/how-to-deploy) for the full editor walkthrough.

## Step 4 — Base prompt [#step-4--base-prompt]

Keep it short. The prompt sets identity and the response shape; the rules in Step 5 carry policy.

```
You are an on-call SRE.

When asked to investigate an issue:
- Pull errors and traces from the last 30 minutes.
- Correlate with recent deploys and config changes.
- Reply with: a 1-sentence root cause, 2-3 lines of supporting
  evidence (errors, deploy times, code references), and a
  suggested fix.

When asked to fix it: open a PR with the change and link it in
the Slack thread. Do not merge. Do not push to main.

Be calm and precise. No speculation without data. If you can't
find evidence for a cause, say so and propose the next check.
```

## Step 5 — Rules to direct the agent [#step-5--rules-to-direct-the-agent]

The base prompt is identity. The rules are the local policy — your severities, your branch names, which services the agent is allowed to change vs. read. Put each topic in its own rule so you can iterate them independently.

Open [Settings → Artifacts → Rules](https://app.nairi.ai/settings/artifacts) and create a few rules, then attach them to the agent. Suggested starting set:

### Service catalogue [#service-catalogue]

The most important rule. Tells the agent which repos map to which services, which dashboards belong to each, and what the agent is allowed to change.

```
# Service catalogue

## payments-service
- Repo: github.com/acme/payments-service
- Sentry project: payments-prod
- Datadog service: payments-service
- Deploy: Render service `srv_payments_prod`
- Owner team: payments
- Write access: ALLOWED (agent may open PRs)

## order-service
- Repo: github.com/acme/order-service
- Sentry project: order-prod
- Datadog service: order-service
- Deploy: Render service `srv_order_prod`
- Owner team: orders
- Write access: ALLOWED (agent may open PRs)

## auth-service
- Repo: github.com/acme/auth-service
- Sentry project: auth-prod
- Write access: READ-ONLY (security-sensitive, page the owner instead)
```

### Severity definitions [#severity-definitions]

Stops the agent from over- or under-reacting.

```
# Severity

- SEV1: customer-facing outage or data loss. Investigate immediately,
  even if just paged by a teammate, not an alert.
- SEV2: degraded experience for a measurable subset of users
  (error rate >2%, p95 latency >2x baseline). Investigate within
  10 minutes; loop in the on-call human via @on-call.
- SEV3: spiky errors, slow tails, or noise. Investigate when asked;
  do not auto-page anyone.

When the user pings you about an issue, classify it in the first
reply and base your urgency on that classification.
```

### Branch and PR conventions [#branch-and-pr-conventions]

So the fix PR matches what humans would have opened.

```
# Branch and PR conventions

- Branch name: `fix/incident-YYYYMMDD-<short-slug>`.
- PR title: `[incident] <one-line summary>` (e.g.
  `[incident] order-service: read transaction id from data.transaction.id`).
- PR body must include:
  1. Symptom (what users / dashboards saw).
  2. Root cause (one sentence).
  3. Fix summary (what changed and why).
  4. How verified (logs / a local repro / a test).
  5. Rollback plan if the fix itself misbehaves.
- Always request review from CODEOWNERS. Never merge.
- Add label `incident` to the PR.
```

### Investigation playbook [#investigation-playbook]

Saves a step every time. Tune it to how your team actually triages.

```
# Investigation playbook

When pulled into an incident, in order:

1. Get the symptom: what error, which endpoint, how many users.
2. Pull last 30 min from Sentry: top issues, new issues, error rate.
3. Check Datadog for matching latency / throughput anomaly.
4. List recent deploys (last 2 hours) across affected services.
5. If a deploy correlates: read the diff in that deploy's PR.
6. Form a hypothesis. State it before running more queries.
7. Confirm or rule out with one more targeted check.
8. Report: 1-sentence root cause, evidence, proposed fix.

Never skip step 6. Hypothesis-free debugging burns budget fast.
```

You don't need all four on day one. Start with the service catalogue, then add the others as you notice the agent doing things you wish it wouldn't.

See [Adding rules](/help/building-agents/how-to-deploy#adding-rules) for the full editor reference.

## Step 6 — First run [#step-6--first-run]

In your incidents channel:

```
@Nairi we are seeing some 5xx errors on order-service, can u see whats happening?
```

The agent should:

1. React with `⏳` then `👀` (queued, then running).
2. Reply with the 1-sentence root cause + supporting evidence (errors, deploy times).
3. Wait for your follow-up ("ok pls fix") before touching the codebase.
4. On confirmation, push a branch and open a PR matching your conventions, then link it in the thread.

If the investigation is off, the service catalogue is almost always the lever to fix it. Update the rule and try again.

## Customisation [#customisation]

* **Customer status updates.** Add a rule for drafting a public status post alongside the internal RCA. The agent produces both and the on-call human picks which goes out.
* **Auto-rollback.** If your deploy MCP supports it, you can authorise the agent to roll back a clearly-bad deploy instead of fixing forward. Pair this with a rule defining "clearly bad" (e.g. error rate >5x baseline for >5 minutes after deploy).
* **Scheduled health checks.** Wire a [scheduled job](/help/scheduled-jobs) that runs every morning, pulls overnight errors + deploy activity, and posts a short health digest to a channel.
* **Pre-merge regression hint.** Give the agent the GitHub MCP from the marketplace as well and trigger it from CI to look at the diff plus recent incidents in the affected paths — it can flag risky changes before they ship.

## Related [#related]

* [Vaults & Secrets](/help/vaults)
* [MCP Tools](/help/mcp)
* [Connecting GitHub](/help/integrations/github)
* [How to deploy an agent](/help/building-agents/how-to-deploy)
* [Scheduled Jobs & Automations](/help/scheduled-jobs)

***

*Can't find what you're looking for? Email [support@nairi.ai](mailto:support@nairi.ai).*