Automating the First 15 Minutes of an Outage

Pull device state, last changes, and topology context automatically so the team spends time fixing, not searching.

AutomationIncident ResponseISP OperationsNOC

Pull device state, last changes, and topology context automatically so the team spends time fixing, not searching.

Why the first 15 minutes matter

Most outages aren’t solved in the first 15 minutes — but most outages are either stabilized or made worse in the first 15 minutes.

The difference is usually context: what changed, what’s impacted, and where the blast radius stops.

What we’re optimizing
Faster context → faster hypothesis → fewer wrong escalations → fewer customers affected.

What good triage should produce

Impact summary: who is down, where, and how badly.
Probable scope: single device vs node vs upstream provider.
Recent change context: configs, deployments, maintenance windows.
First hypothesis + next action: what we’re testing first and why.

Inputs you can auto-collect

Before anyone starts guessing in chat, your system should attach the same basic context to every incident. Start simple: one structured incident note containing the essentials.

Affected services (PPPoE, DHCP, DNS, transit, core, last-mile nodes)
Top alarms in the last 5–15 minutes (deduped by device + trigger)
Device status snapshot (CPU, memory, interfaces, optical levels where available)
Last change (config diff, last commit, last maintenance action, last deploy)
Topology context (upstream/downstream neighbors, site/node mapping)

Keep it actionable
Avoid dumping raw logs. Summarize + link. Your goal is speed and clarity, not completeness.

A simple first-15 workflow

Declare incident + owner (one person drives the checklist).
Capture impact in one sentence (region/service/customer count if known).
Attach auto-collected context (status snapshot, top alarms, last change, topology).
Pick first hypothesis (based on evidence, not preference).
Run 1–2 fast checks to confirm/deny (ping/trace, interface errors, BGP state).
If unclear: narrow scope (single site? upstream? node?).
Update stakeholders with impact + next update time (every 10–15 mins).

Copy-paste incident note template

INCIDENT: [Short name]
OWNER: [Name]
SEVERITY: [S1 / S2 / S3]
START: [Time]
STATUS: Investigating / Mitigating / Monitoring

IMPACT (1 sentence):

[Who / where / what is affected]

CURRENT SIGNAL:

Top alarms:
Affected nodes/sites:
Customer reports:

AUTO-CONTEXT LINKS:

Device snapshot:
Last change / config diff:
Topology view:

FIRST HYPOTHESIS:

[What we think is happening + why]

NEXT ACTIONS (next 15 mins):

Check …
Verify …
Escalate to … if …

Common pitfalls

Too many people “driving” — assign one owner early.
No timeboxes — always state the next update time.
Log dumping — summarize, then link to details.
No change context — most incidents correlate with a recent change.

If you can’t explain the impact clearly, you’re not ready to troubleshoot yet. — NOC principle

Next steps

Start with one automation: create an incident note and auto-attach device snapshot + last change + topology.

Once that’s reliable, add deduped alarm summaries and a lightweight “suspected scope” label (single device / node / upstream).

X