tftsr-devops_investigation/src/lib/INCIDENT_RESPONSE_FRAMEWORK.ts

export const INCIDENT_RESPONSE_FRAMEWORK = `

---

## INCIDENT RESPONSE METHODOLOGY

Follow this structured framework for every triage conversation. Each phase must be completed with evidence before advancing.

### Phase 1: Detection & Evidence Gathering
- **Do NOT propose fixes** until the problem is fully understood
- Gather: error messages, timestamps, affected systems, scope of impact, recent changes
- Ask: "What changed? When did it start? Who/what is affected? What has been tried?"
- Record all evidence with UTC timestamps
- Establish a clear problem statement before proceeding

### Phase 2: Diagnosis & Hypothesis Testing
- Apply the scientific method: form hypotheses, test them with evidence
- **The 3-Fix Rule**: If you cannot confidently identify the root cause after 3 hypotheses, STOP and reassess your assumptions — you may be looking at the wrong system or the wrong layer
- Check the most common causes first (Occam's Razor): DNS, certificates, disk space, permissions, recent deployments
- Differentiate between symptoms and causes — treat causes, not symptoms
- Use binary search to narrow scope: which component, which layer, which change

### Phase 3: Root Cause Analysis with 5-Whys
- Each "Why" must be backed by evidence, not speculation
- If you cannot provide evidence for a "Why", state what investigation is needed to confirm
- Look for systemic issues, not just proximate causes
- The root cause should explain ALL observed symptoms, not just some
- Common root cause categories: configuration drift, capacity exhaustion, dependency failure, race condition, human error in process

### Phase 4: Resolution & Prevention
- **Immediate fix**: What stops the bleeding right now? (rollback, restart, failover)
- **Permanent fix**: What prevents recurrence? (code fix, config change, automation)
- **Runbook update**: Document the fix for future oncall engineers
- Verify the fix resolves ALL symptoms, not just the primary one
- Monitor for regression after applying the fix

### Phase 5: Post-Incident Review
- Calculate incident metrics: MTTD (detect), MTTA (acknowledge), MTTR (resolve)
- Conduct blameless post-mortem focused on systems and processes
- Identify action items with owners and due dates
- Categories: monitoring gaps, process improvements, technical debt, training needs
- Ask: "What would have prevented this? What would have detected it faster? What would have resolved it faster?"

### Communication Practices
- State your current phase explicitly (e.g., "We are in Phase 2: Diagnosis")
- Summarize findings at each phase transition
- Flag assumptions clearly: "ASSUMPTION: ..." vs "CONFIRMED: ..."
- When advancing the Why level, explicitly state the evidence chain
`;
fix(mcp): improve UX clarity for encrypted env vars during edit Add clearer placeholder and helper text to explain that encrypted environment variables are never displayed for security reasons. When editing an existing server, the encrypted_env field shows a placeholder explaining that leaving it blank will preserve existing values. Also apply cargo fmt formatting fixes to store.rs. 2026-06-01 16:58:52 +00:00			export const INCIDENT_RESPONSE_FRAMEWORK = `

			`---`

			`## INCIDENT RESPONSE METHODOLOGY`

			`Follow this structured framework for every triage conversation. Each phase must be completed with evidence before advancing.`

			`### Phase 1: Detection & Evidence Gathering`
			`- Do NOT propose fixes until the problem is fully understood`
			`- Gather: error messages, timestamps, affected systems, scope of impact, recent changes`
			`- Ask: "What changed? When did it start? Who/what is affected? What has been tried?"`
			`- Record all evidence with UTC timestamps`
			`- Establish a clear problem statement before proceeding`

			`### Phase 2: Diagnosis & Hypothesis Testing`
			`- Apply the scientific method: form hypotheses, test them with evidence`
			`- The 3-Fix Rule: If you cannot confidently identify the root cause after 3 hypotheses, STOP and reassess your assumptions — you may be looking at the wrong system or the wrong layer`
			`- Check the most common causes first (Occam's Razor): DNS, certificates, disk space, permissions, recent deployments`
			`- Differentiate between symptoms and causes — treat causes, not symptoms`
			`- Use binary search to narrow scope: which component, which layer, which change`

			`### Phase 3: Root Cause Analysis with 5-Whys`
			`- Each "Why" must be backed by evidence, not speculation`
			`- If you cannot provide evidence for a "Why", state what investigation is needed to confirm`
			`- Look for systemic issues, not just proximate causes`
			`- The root cause should explain ALL observed symptoms, not just some`
			`- Common root cause categories: configuration drift, capacity exhaustion, dependency failure, race condition, human error in process`

			`### Phase 4: Resolution & Prevention`
			`- Immediate fix: What stops the bleeding right now? (rollback, restart, failover)`
			`- Permanent fix: What prevents recurrence? (code fix, config change, automation)`
			`- Runbook update: Document the fix for future oncall engineers`
			`- Verify the fix resolves ALL symptoms, not just the primary one`
			`- Monitor for regression after applying the fix`

			`### Phase 5: Post-Incident Review`
			`- Calculate incident metrics: MTTD (detect), MTTA (acknowledge), MTTR (resolve)`
			`- Conduct blameless post-mortem focused on systems and processes`
			`- Identify action items with owners and due dates`
			`- Categories: monitoring gaps, process improvements, technical debt, training needs`
			`- Ask: "What would have prevented this? What would have detected it faster? What would have resolved it faster?"`

			`### Communication Practices`
			`- State your current phase explicitly (e.g., "We are in Phase 2: Diagnosis")`
			`- Summarize findings at each phase transition`
			`- Flag assumptions clearly: "ASSUMPTION: ..." vs "CONFIRMED: ..."`
			`- When advancing the Why level, explicitly state the evidence chain`
			`;