AI SRE — investigate and fix prod issues from Slack

Wire up logs, recent deploys, and the codebase. The agent investigates alerts from Slack, finds the root cause, and can open the fix as a PR.

Outcome

An on-call agent in a Slack channel that can take an alert (or a teammate's ping), pull logs and traces, correlate them with recent deploys, identify the root cause, and submit a fix as a pull request for human review. The human stays in Slack the whole time and only acts as the approver.

Prerequisites

An observability source: Sentry, Datadog, Render, or anything with an MCP server.
A Slack channel for the agent. #incidents works well.
GitHub access to the affected repos via Nairi's GitHub App.
An LLM provider connected. Claude Sonnet via the Anthropic integration is the recommended pick for this playbook. See Connecting your LLM provider.

Step 1 — Connect GitHub

The agent needs the codebase to investigate the failing code path and to push the fix.

Install the Nairi GitHub App at Settings → Integrations → GitHub and grant access to the repos the agent should be able to read and open PRs against. Scope this tightly. If the agent is on-call for payments-service and order-service, give it those two repos and nothing else.

See Connecting GitHub for the full walkthrough.

Step 2 — Add the observability MCPs

Sentry and Datadog (one-click from the Marketplace)

Both Sentry and Datadog are available as one-click OAuth toolkits on the MCP Marketplace. Open Settings → MCP Marketplace, pick Sentry and Datadog, and connect each. OAuth handles the credentials, so there's nothing to vault or paste as JSON. The connections live at the org level, so any agent can use them.

Deploy platform (Render shown)

If you want the agent to correlate spikes with recent deploys, give it read access to your deploy platform. Render ships an official hosted MCP — point Nairi at the URL with an API key from Render → Account Settings → API Keys.

Open Settings → Artifacts → MCP Configs → New, mark the config as sensitive, and paste:

{
  "mcpServers": {
    "render": {
      "url": "https://mcp-server.render.com",
      "env": {
        "RENDER_API_KEY": "YOUR_RENDER_API_KEY"
      }
    }
  }
}

Vercel, Heroku, and Fly all ship comparable MCP servers; check each vendor's docs for the canonical config.

You can attach any combination — start with the one observability tool your team already uses and add more only when you find the agent's investigations missing context.

Step 3 — Create the agent

Open Fleet → New agent and fill in:

Name — sre works. The team mentions the Nairi bot (@Nairi) in a channel bound to this agent — the agent name shows up in the agent picker, not in the Slack mention.
LLM integration — Anthropic (Claude Sonnet). Multi-step investigations benefit from Sonnet over smaller models.
Repository — the primary service repo. The agent clones this into a fresh worktree per task.
MCPs — attach the observability and deploy-platform configs from Step 2.
Channel — bind to #incidents, or leave it and let the channel self-bind on the first @-mention.

See How to deploy an agent for the full editor walkthrough.

Step 4 — Base prompt

Keep it short. The prompt sets identity and the response shape; the rules in Step 5 carry policy.

You are an on-call SRE.

When asked to investigate an issue:
- Pull errors and traces from the last 30 minutes.
- Correlate with recent deploys and config changes.
- Reply with: a 1-sentence root cause, 2-3 lines of supporting
  evidence (errors, deploy times, code references), and a
  suggested fix.

When asked to fix it: open a PR with the change and link it in
the Slack thread. Do not merge. Do not push to main.

Be calm and precise. No speculation without data. If you can't
find evidence for a cause, say so and propose the next check.

Step 5 — Rules to direct the agent

The base prompt is identity. The rules are the local policy — your severities, your branch names, which services the agent is allowed to change vs. read. Put each topic in its own rule so you can iterate them independently.

Open Settings → Artifacts → Rules and create a few rules, then attach them to the agent. Suggested starting set:

Service catalogue

The most important rule. Tells the agent which repos map to which services, which dashboards belong to each, and what the agent is allowed to change.

# Service catalogue

## payments-service
- Repo: github.com/acme/payments-service
- Sentry project: payments-prod
- Datadog service: payments-service
- Deploy: Render service `srv_payments_prod`
- Owner team: payments
- Write access: ALLOWED (agent may open PRs)

## order-service
- Repo: github.com/acme/order-service
- Sentry project: order-prod
- Datadog service: order-service
- Deploy: Render service `srv_order_prod`
- Owner team: orders
- Write access: ALLOWED (agent may open PRs)

## auth-service
- Repo: github.com/acme/auth-service
- Sentry project: auth-prod
- Write access: READ-ONLY (security-sensitive, page the owner instead)

Severity definitions

Stops the agent from over- or under-reacting.

# Severity

- SEV1: customer-facing outage or data loss. Investigate immediately,
  even if just paged by a teammate, not an alert.
- SEV2: degraded experience for a measurable subset of users
  (error rate >2%, p95 latency >2x baseline). Investigate within
  10 minutes; loop in the on-call human via @on-call.
- SEV3: spiky errors, slow tails, or noise. Investigate when asked;
  do not auto-page anyone.

When the user pings you about an issue, classify it in the first
reply and base your urgency on that classification.

Branch and PR conventions

So the fix PR matches what humans would have opened.

# Branch and PR conventions

- Branch name: `fix/incident-YYYYMMDD-<short-slug>`.
- PR title: `[incident] <one-line summary>` (e.g.
  `[incident] order-service: read transaction id from data.transaction.id`).
- PR body must include:
  1. Symptom (what users / dashboards saw).
  2. Root cause (one sentence).
  3. Fix summary (what changed and why).
  4. How verified (logs / a local repro / a test).
  5. Rollback plan if the fix itself misbehaves.
- Always request review from CODEOWNERS. Never merge.
- Add label `incident` to the PR.

Investigation playbook

Saves a step every time. Tune it to how your team actually triages.

# Investigation playbook

When pulled into an incident, in order:

1. Get the symptom: what error, which endpoint, how many users.
2. Pull last 30 min from Sentry: top issues, new issues, error rate.
3. Check Datadog for matching latency / throughput anomaly.
4. List recent deploys (last 2 hours) across affected services.
5. If a deploy correlates: read the diff in that deploy's PR.
6. Form a hypothesis. State it before running more queries.
7. Confirm or rule out with one more targeted check.
8. Report: 1-sentence root cause, evidence, proposed fix.

Never skip step 6. Hypothesis-free debugging burns budget fast.

You don't need all four on day one. Start with the service catalogue, then add the others as you notice the agent doing things you wish it wouldn't.

See Adding rules for the full editor reference.

Step 6 — First run

In your incidents channel:

@Nairi we are seeing some 5xx errors on order-service, can u see whats happening?

The agent should:

React with ⏳ then 👀 (queued, then running).
Reply with the 1-sentence root cause + supporting evidence (errors, deploy times).
Wait for your follow-up ("ok pls fix") before touching the codebase.
On confirmation, push a branch and open a PR matching your conventions, then link it in the thread.

If the investigation is off, the service catalogue is almost always the lever to fix it. Update the rule and try again.

Customisation

Customer status updates. Add a rule for drafting a public status post alongside the internal RCA. The agent produces both and the on-call human picks which goes out.
Auto-rollback. If your deploy MCP supports it, you can authorise the agent to roll back a clearly-bad deploy instead of fixing forward. Pair this with a rule defining "clearly bad" (e.g. error rate >5x baseline for >5 minutes after deploy).
Scheduled health checks. Wire a scheduled job that runs every morning, pulls overnight errors + deploy activity, and posts a short health digest to a channel.
Pre-merge regression hint. Give the agent the GitHub MCP from the marketplace as well and trigger it from CI to look at the diff plus recent incidents in the affected paths — it can flag risky changes before they ship.

Can't find what you're looking for? Email support@nairi.ai.

AI SRE — investigate and fix prod issues from Slack

On this page