Use case

An on-call agent in your incident channel

Mention Nairi in the incident channel. The agent queries Datadog, tails logs, correlates against recent deploys, and posts a structured summary while your on-call stays focused on the fix.

What this looks like

Most of the first 20 minutes of an incident is the on-call hunting through dashboards, CLIs, and Slack history to figure out what is actually happening. Most of those queries are the same every time. Most of them are reads.

Nairi sits in the incident channel and does the reads for you. Mention it, ask a specific question (error rate on checkout, last deploy to payments, tail logs from the postgres replica), get back a structured answer with the actual numbers and the relevant URL. Multi-person, multi-thread, shared org context. The agent has the same observability tools the on-call already uses, set up once at the org level via MCP and the vault.

It also drafts the customer status update and the postmortem timeline. Production write actions like restarting pods or rolling a deploy are opt-in per tool; most teams start fully read-only.

What it does during an incident

Starting points, not the full set. The agent uses whatever tools you give it.

Pull metrics, logs, and traces on demand

Mention @Nairi in the incident channel and ask "what does payments error rate look like in the last 30 minutes?" The agent queries the observability stack you have connected (Datadog, Sentry, Grafana, Honeycomb, etc.) and posts back a structured answer.

Correlate against recent deploys and config changes

Ask "any deploys to the payments service today?" or "what changed in the last hour?". The agent reads from GitHub, your CD system, and your feature-flag tool, and lines up the timeline against the error spike.

Read runbooks and prior postmortems

Point an agent at your Notion or Confluence runbook library and your prior postmortems. When a similar incident hits, the agent surfaces the matching runbook and the relevant past write-up before anyone goes hunting through links.

Run safe diagnostic commands

Give the agent read-only kubectl, gcloud, or aws CLI access via its sandbox. "Tail the last 200 lines from the payment-service pods" or "list the last 5 cloudwatch alarms in us-east-1". The agent runs it, posts the output, no copy-paste through three tabs.

Draft the status update

Ask "draft a status update for the customer banner". The agent reads the channel transcript and the metrics it already pulled, and writes a non-jargon update you can paste into Statuspage or the customer Slack.

Co-author the postmortem after

When the incident closes, ping the agent to compile a timeline from the channel transcript, the deploy timeline, and the metric correlations. First-draft postmortem in five minutes instead of an hour.

Three scenarios teams use it for

From the first page through the postmortem.

01

3am page, you want context before you open a laptop

You acknowledge the page and ping Nairi from your phone: "what is going on with checkout?". The agent queries Datadog for the latest error rate, fetches recent Sentry events, pulls the deploy timeline for checkout-service, and posts a one-paragraph hypothesis back in the channel. You read the summary in the back of an Uber, not after twenty minutes of dashboard hunting.

02

Incident commander coordinating multiple responders

Create #inc-payments-down, invite the on-call, mention the agent: "summarize what we know so far, every 10 minutes". The agent keeps a running summary at the top of the channel that pulls together every metric reading, every command output, and every decision the team has made. A new responder joining the channel gets caught up in 30 seconds without scrolling through 400 messages.

03

Postmortem co-author

After the page clears, in the same channel: "compile a postmortem timeline from this incident, include the metric correlations, the deploy that triggered it, the rollback decision, and the customer-facing duration". The agent writes a first-draft postmortem with citations into the channel transcript and the monitoring data. You edit, you publish, you go to bed.

How it works

One agent, set up once, scoped to the on-call context. Production-safe defaults.

  1. 1

    Connect your observability and infra tools via MCP

    Add Datadog, Sentry, Grafana, Honeycomb, PagerDuty, Statuspage, kubectl, AWS/GCP/Azure, or anything else through the MCP server marketplace. The credentials live in the org vault, not in every engineer's config file. Per-agent: pick which tools each agent can reach.

  2. 2

    Define the on-call agent's rules and skills

    Reusable skills like "always check error rate, then deploy timeline, then recent config changes" or "draft updates in plain English, no internal jargon". Custom rules per agent. The on-call agent has its own playbook; it isn't the same agent as the one that opens PRs.

  3. 3

    Mention it in the incident channel

    Same Slack or Discord channel your team already pages in. The agent reads the conversation context, runs the queries, and posts back in the same thread. Multi-person collaboration is the default. Everyone in the channel can ask the agent things, and the agent shares context across the conversation.

  4. 4

    It runs in its sandbox, your prod stays protected

    Each invocation spins up an isolated container with shell access and your toolchain. The agent runs commands inside that container. Production write access is opt-in, per-tool: read-only kubectl by default, write only if you explicitly grant it. Self-host the runtime via the open-source nairid daemon if your compliance posture requires it.

Put an on-call agent in your incident channel

Connect your observability tools, define the playbook, mention the agent in your next incident. Read-only by default, multi-person by default, self-host the runtime if you need to.

Questions about on-call use

What SRE and platform teams ask before they try it.

It saves you 15-20 minutes of dashboard-and-CLI hunting at the start of every incident, and it keeps a running summary of what's going on so people joining late get caught up immediately. Practically: you mention the agent in the incident channel, ask it specific questions ("what is the checkout error rate doing", "what deployed in the last hour", "tail logs from payment-service"), it runs the queries and posts the output back. Plus it can draft status updates and a postmortem timeline at the end.
Anything reachable via MCP, plus anything you can hand it as a shell tool. Common ones we see in practice: Datadog, Sentry, Grafana, Honeycomb, New Relic, PagerDuty, Statuspage, kubectl, gcloud, aws CLI, terraform, GitHub, LaunchDarkly. If a tool has an MCP server (we list common ones in the platform's MCP marketplace) the agent can call it directly. If it doesn't, the agent can shell out to its CLI inside the sandbox.
Via the org vault. You add Datadog API keys, AWS credentials, kubectl configs once, in one place. The vault injects them into the agent's sandbox at runtime via a secret proxy, so the agent never sees the plaintext value in its context or its logs. The credentials live at the org level, not on individual engineers' laptops.
Only if you let it. By default the on-call agent is read-only: it can query Datadog, tail logs, read GitHub, list pods. Write actions like restarting pods, rolling back a deploy, or paging additional responders are opt-in per tool. Most teams start fully read-only and add specific write permissions after a few weeks of read-only operation.
Yes. The agent is owned by the org, not by an individual engineer. Anyone in the incident channel can mention it. The agent reads the full thread context for each request, so the SRE asking about Datadog metrics and the eng lead asking about the deploy timeline are talking to the same agent with shared context. Compare against per-user agent products where each responder has their own session and the agent has no idea what the other responders just discovered.
Yes. Reusable skills capture procedures like "checkout incident playbook: check error rate, then payment provider status, then deploy timeline, then card-vault credentials". Custom rules at the agent level capture how the agent should communicate (no jargon in customer-facing updates, always cite the runbook URL, escalate to a human after N minutes without resolution). The playbook is config, not prose buried in Notion.
Runbooks tell a human what to do. The agent does it. The two stack: keep the runbooks for the human responder, give the agent the runbook URL so it can follow the same procedure for the read-only diagnostic steps. The agent gets you to the same place a senior engineer with five years of runbook knowledge would, just in 30 seconds instead of 15 minutes, and it does it the same way every time without skipping a step at 3am.
Yes. The nairid daemon is open-source Go and runs the full agent loop, including the sandbox where commands execute, on your hardware. Slack/Discord messages route through the Nairi backend for delivery, but the actual agent reasoning, tool calls, and any production data the agent reads stay inside the daemon you control. Useful when the incident data itself is sensitive (PII, payment data, regulated workloads).
Set up one agent with read-only access to your primary observability tool (Datadog or Sentry or whichever you live in). Add it to your incident channel as an experiment for a week. Ask it questions you'd normally ask a dashboard. The first time it catches a correlation a human missed, the team will adopt it on their own. Then expand: add kubectl read-only, add the runbook library, add postmortem drafting. No need to rewrite the process up front.