Incident Cheatsheet - Read And Download In Case Of Emergency

July 1, 2025
Alla Bby Alla B
Cover image for Incident Cheatsheet - Read And Download In Case Of Emergency
Keep Calm And Follow A Proven Approach
Incident Management Cheatsheet (PDF)
Innovective Incident Cheatsheet.pdf · PDF · 1.2 MB
Download

If your team already has an incident process, follow that instead. Don't debate processes during an active incident - use what you have and improve it later.

STOP, THINK, then ACT

Before you touch anything

Write down exactly what you're seeing and when it started. Take a screenshot of errors/dashboards if that is what you saw. Check if customers are actually affected (don't assume). Start a shared document/chat thread - paste the incident link at the top. Write down everything you do with times: commands, changes, rollbacks - so you can undo.

Things NOT To Do

Don't panic-deploy. Review changes even under pressure. Don't act before thinking. If you are still thinking - hands off the keyboard. Don't work in isolation. Communicate even if you're confident. Don't guess or assume. Use information, metrics and expertise. Don't forget rollback. If your fix doesn't work, undo it. Don't assume resolution. Verify that the incident has been resolved.

RESPOND

Assess & Communicate

Severity check: Is this breaking core user flows? Revenue impact? Data loss? Post in incident channel: "Investigating [brief description] - will update in 10 min". Check recent deployments - anything in last 2 hours? Look at monitoring dashboards for obvious spikes/drops.

Quick Wins

Check if it's a known issue with existing workaround. Can you roll back recent changes? (If yes, do it - investigate later). Is this a third party issue outside of your control? Reach out to their support and focus on informing stakeholders with any updates. Restart services if they're clearly crashed (and note what you restarted). Scale up if it's obviously a capacity issue. You can scale down after the issue is resolved.

ESCALATE

Escalate Immediately If

Customer data might be compromised. Revenue-generating features are down >30 min. You don't understand the system well enough to investigate safely. Multiple capabilities are failing. You've been stuck for 30+ min with no progress.

How To Escalate

Tag relevant team leads in incident channel. Provide clear handoff: what you've tried, what you learned, current status. Don't disappear - offer to help with testing/monitoring.

INVESTIGATE

Gather Information

Check logs for errors around the start time. Look at recent alerts/monitoring data. Ask: "What changed?" - deployments, config, infrastructure, dependencies. Check external services/APIs - are they having issues? Test the happy path - does basic functionality work?

Document Everything

Update incident channel every 15 minutes even if "still investigating". Keep a running log of what you've checked and results. Screenshot important error messages and metrics. Note any temporary workarounds you've tried.

RECOVER

Before Declaring Victory

Test the customer journey end-to-end. Check metrics for 15+ minutes. Are they normal? Ask team members to spot-check their areas. Verify the fix doesn't break anything else.

After resolution

Update status page if you have one. Thank everyone who helped. Schedule post-mortem within the next 48 hours. Document lessons learned.

Strengthen Your Incident Management & Resilience

Don't wait for the next outage to find your gaps. We help teams build robust on-call, incident response, and disaster recovery processes based on our experience - from early stage startups to large volume banks.

Communication Templates

🚨 INCIDENT: [Brief description]
Impact: [Customer-facing? Internal only?]
Started: [Time noticed]
Investigating: [Your name]
Updates: Every 15 min

ℹ️ UPDATE: [What you've found/tried]
Status: [Still investigating/Fix in progress/Resolved]
ETA: [Best guess or "Unknown"]
Next: [What you're doing next]

✅ RESOLVED: [What was broken]
Cause: [Root cause if known]
Fix: [What solved it]
Time to resolution: [Duration]
Post-mortem: [Will schedule/Not needed]