Use case
An on-call agent in your incident channel
Mention Nairi in the incident channel. The agent queries Datadog, tails logs, correlates against recent deploys, and posts a structured summary while your on-call stays focused on the fix.
What this looks like
Most of the first 20 minutes of an incident is the on-call hunting through dashboards, CLIs, and Slack history to figure out what is actually happening. Most of those queries are the same every time. Most of them are reads.
Nairi sits in the incident channel and does the reads for you. Mention it, ask a specific question (error rate on checkout, last deploy to payments, tail logs from the postgres replica), get back a structured answer with the actual numbers and the relevant URL. Multi-person, multi-thread, shared org context. The agent has the same observability tools the on-call already uses, set up once at the org level via MCP and the vault.
It also drafts the customer status update and the postmortem timeline. Production write actions like restarting pods or rolling a deploy are opt-in per tool; most teams start fully read-only.
What it does during an incident
Starting points, not the full set. The agent uses whatever tools you give it.
Pull metrics, logs, and traces on demand
Mention @Nairi in the incident channel and ask "what does payments error rate look like in the last 30 minutes?" The agent queries the observability stack you have connected (Datadog, Sentry, Grafana, Honeycomb, etc.) and posts back a structured answer.
Correlate against recent deploys and config changes
Ask "any deploys to the payments service today?" or "what changed in the last hour?". The agent reads from GitHub, your CD system, and your feature-flag tool, and lines up the timeline against the error spike.
Read runbooks and prior postmortems
Point an agent at your Notion or Confluence runbook library and your prior postmortems. When a similar incident hits, the agent surfaces the matching runbook and the relevant past write-up before anyone goes hunting through links.
Run safe diagnostic commands
Give the agent read-only kubectl, gcloud, or aws CLI access via its sandbox. "Tail the last 200 lines from the payment-service pods" or "list the last 5 cloudwatch alarms in us-east-1". The agent runs it, posts the output, no copy-paste through three tabs.
Draft the status update
Ask "draft a status update for the customer banner". The agent reads the channel transcript and the metrics it already pulled, and writes a non-jargon update you can paste into Statuspage or the customer Slack.
Co-author the postmortem after
When the incident closes, ping the agent to compile a timeline from the channel transcript, the deploy timeline, and the metric correlations. First-draft postmortem in five minutes instead of an hour.
Three scenarios teams use it for
From the first page through the postmortem.
01
3am page, you want context before you open a laptop
You acknowledge the page and ping Nairi from your phone: "what is going on with checkout?". The agent queries Datadog for the latest error rate, fetches recent Sentry events, pulls the deploy timeline for checkout-service, and posts a one-paragraph hypothesis back in the channel. You read the summary in the back of an Uber, not after twenty minutes of dashboard hunting.
02
Incident commander coordinating multiple responders
Create #inc-payments-down, invite the on-call, mention the agent: "summarize what we know so far, every 10 minutes". The agent keeps a running summary at the top of the channel that pulls together every metric reading, every command output, and every decision the team has made. A new responder joining the channel gets caught up in 30 seconds without scrolling through 400 messages.
03
Postmortem co-author
After the page clears, in the same channel: "compile a postmortem timeline from this incident, include the metric correlations, the deploy that triggered it, the rollback decision, and the customer-facing duration". The agent writes a first-draft postmortem with citations into the channel transcript and the monitoring data. You edit, you publish, you go to bed.
How it works
One agent, set up once, scoped to the on-call context. Production-safe defaults.
- 1
Connect your observability and infra tools via MCP
Add Datadog, Sentry, Grafana, Honeycomb, PagerDuty, Statuspage, kubectl, AWS/GCP/Azure, or anything else through the MCP server marketplace. The credentials live in the org vault, not in every engineer's config file. Per-agent: pick which tools each agent can reach.
- 2
Define the on-call agent's rules and skills
Reusable skills like "always check error rate, then deploy timeline, then recent config changes" or "draft updates in plain English, no internal jargon". Custom rules per agent. The on-call agent has its own playbook; it isn't the same agent as the one that opens PRs.
- 3
Mention it in the incident channel
Same Slack or Discord channel your team already pages in. The agent reads the conversation context, runs the queries, and posts back in the same thread. Multi-person collaboration is the default. Everyone in the channel can ask the agent things, and the agent shares context across the conversation.
- 4
It runs in its sandbox, your prod stays protected
Each invocation spins up an isolated container with shell access and your toolchain. The agent runs commands inside that container. Production write access is opt-in, per-tool: read-only kubectl by default, write only if you explicitly grant it. Self-host the runtime via the open-source nairid daemon if your compliance posture requires it.
Put an on-call agent in your incident channel
Connect your observability tools, define the playbook, mention the agent in your next incident. Read-only by default, multi-person by default, self-host the runtime if you need to.
Questions about on-call use
What SRE and platform teams ask before they try it.