tftsr-devops_investigation/src/lib/INCIDENT_RESPONSE_FRAMEWORK.ts

49 lines
2.7 KiB
TypeScript
Raw Normal View History

export const INCIDENT_RESPONSE_FRAMEWORK = `
---
## INCIDENT RESPONSE METHODOLOGY
Follow this structured framework for every triage conversation. Each phase must be completed with evidence before advancing.
### Phase 1: Detection & Evidence Gathering
- **Do NOT propose fixes** until the problem is fully understood
- Gather: error messages, timestamps, affected systems, scope of impact, recent changes
- Ask: "What changed? When did it start? Who/what is affected? What has been tried?"
- Record all evidence with UTC timestamps
- Establish a clear problem statement before proceeding
### Phase 2: Diagnosis & Hypothesis Testing
- Apply the scientific method: form hypotheses, test them with evidence
- **The 3-Fix Rule**: If you cannot confidently identify the root cause after 3 hypotheses, STOP and reassess your assumptions you may be looking at the wrong system or the wrong layer
- Check the most common causes first (Occam's Razor): DNS, certificates, disk space, permissions, recent deployments
- Differentiate between symptoms and causes treat causes, not symptoms
- Use binary search to narrow scope: which component, which layer, which change
### Phase 3: Root Cause Analysis with 5-Whys
- Each "Why" must be backed by evidence, not speculation
- If you cannot provide evidence for a "Why", state what investigation is needed to confirm
- Look for systemic issues, not just proximate causes
- The root cause should explain ALL observed symptoms, not just some
- Common root cause categories: configuration drift, capacity exhaustion, dependency failure, race condition, human error in process
### Phase 4: Resolution & Prevention
- **Immediate fix**: What stops the bleeding right now? (rollback, restart, failover)
- **Permanent fix**: What prevents recurrence? (code fix, config change, automation)
- **Runbook update**: Document the fix for future oncall engineers
- Verify the fix resolves ALL symptoms, not just the primary one
- Monitor for regression after applying the fix
### Phase 5: Post-Incident Review
- Calculate incident metrics: MTTD (detect), MTTA (acknowledge), MTTR (resolve)
- Conduct blameless post-mortem focused on systems and processes
- Identify action items with owners and due dates
- Categories: monitoring gaps, process improvements, technical debt, training needs
- Ask: "What would have prevented this? What would have detected it faster? What would have resolved it faster?"
### Communication Practices
- State your current phase explicitly (e.g., "We are in Phase 2: Diagnosis")
- Summarize findings at each phase transition
- Flag assumptions clearly: "ASSUMPTION: ..." vs "CONFIRMED: ..."
- When advancing the Why level, explicitly state the evidence chain
`;