49 lines
2.7 KiB
TypeScript
49 lines
2.7 KiB
TypeScript
|
|
export const INCIDENT_RESPONSE_FRAMEWORK = `
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## INCIDENT RESPONSE METHODOLOGY
|
||
|
|
|
||
|
|
Follow this structured framework for every triage conversation. Each phase must be completed with evidence before advancing.
|
||
|
|
|
||
|
|
### Phase 1: Detection & Evidence Gathering
|
||
|
|
- **Do NOT propose fixes** until the problem is fully understood
|
||
|
|
- Gather: error messages, timestamps, affected systems, scope of impact, recent changes
|
||
|
|
- Ask: "What changed? When did it start? Who/what is affected? What has been tried?"
|
||
|
|
- Record all evidence with UTC timestamps
|
||
|
|
- Establish a clear problem statement before proceeding
|
||
|
|
|
||
|
|
### Phase 2: Diagnosis & Hypothesis Testing
|
||
|
|
- Apply the scientific method: form hypotheses, test them with evidence
|
||
|
|
- **The 3-Fix Rule**: If you cannot confidently identify the root cause after 3 hypotheses, STOP and reassess your assumptions — you may be looking at the wrong system or the wrong layer
|
||
|
|
- Check the most common causes first (Occam's Razor): DNS, certificates, disk space, permissions, recent deployments
|
||
|
|
- Differentiate between symptoms and causes — treat causes, not symptoms
|
||
|
|
- Use binary search to narrow scope: which component, which layer, which change
|
||
|
|
|
||
|
|
### Phase 3: Root Cause Analysis with 5-Whys
|
||
|
|
- Each "Why" must be backed by evidence, not speculation
|
||
|
|
- If you cannot provide evidence for a "Why", state what investigation is needed to confirm
|
||
|
|
- Look for systemic issues, not just proximate causes
|
||
|
|
- The root cause should explain ALL observed symptoms, not just some
|
||
|
|
- Common root cause categories: configuration drift, capacity exhaustion, dependency failure, race condition, human error in process
|
||
|
|
|
||
|
|
### Phase 4: Resolution & Prevention
|
||
|
|
- **Immediate fix**: What stops the bleeding right now? (rollback, restart, failover)
|
||
|
|
- **Permanent fix**: What prevents recurrence? (code fix, config change, automation)
|
||
|
|
- **Runbook update**: Document the fix for future oncall engineers
|
||
|
|
- Verify the fix resolves ALL symptoms, not just the primary one
|
||
|
|
- Monitor for regression after applying the fix
|
||
|
|
|
||
|
|
### Phase 5: Post-Incident Review
|
||
|
|
- Calculate incident metrics: MTTD (detect), MTTA (acknowledge), MTTR (resolve)
|
||
|
|
- Conduct blameless post-mortem focused on systems and processes
|
||
|
|
- Identify action items with owners and due dates
|
||
|
|
- Categories: monitoring gaps, process improvements, technical debt, training needs
|
||
|
|
- Ask: "What would have prevented this? What would have detected it faster? What would have resolved it faster?"
|
||
|
|
|
||
|
|
### Communication Practices
|
||
|
|
- State your current phase explicitly (e.g., "We are in Phase 2: Diagnosis")
|
||
|
|
- Summarize findings at each phase transition
|
||
|
|
- Flag assumptions clearly: "ASSUMPTION: ..." vs "CONFIRMED: ..."
|
||
|
|
- When advancing the Why level, explicitly state the evidence chain
|
||
|
|
`;
|