tftsr-devops_investigation/README.md
Shaun Arman d5e180740e
Some checks failed
Test / rust-test (push) Failing after 6s
Test / frontend-test (push) Failing after 54s
docs: remove all Gitea/Gogs/172.0.0.29 references; update to GitHub
Replace every remaining reference to the old Gitea infrastructure with the
new GitHub-hosted equivalents across all documentation, wiki pages, test
files, and historical ticket summaries.

- README.md: CI badge, clone URL, releases link, CI/CD section, project structure
- docs/wiki/CICD-Pipeline.md: full rewrite for GitHub Actions + ghcr.io
- docs/wiki/Home.md: CI badge, releases link, phase status, tech stack
- docs/wiki/Troubleshooting.md: rewrite CI troubleshooting for GitHub Actions
- docs/architecture/README.md: update CI/CD pipeline diagram
- AGENTS.md: CI/CD section, environment references
- PLAN.md: directory structure, pipeline table
- SECURITY_AUDIT.md: mark C3 and L4 findings as resolved
- ticket-git-cliff-changelog.md: workflow path updated
- tickets/ci-runner-speed-optimization.md: image registry updated
- 2026-hackathon_AgenticFeature.md: workflow path updated
- tests: workflow path assertions updated in all three test files
2026-06-01 16:18:34 -05:00

332 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Troubleshooting and RCA Assistant
A structured, AI-backed desktop tool for IT incident triage, 5-Whys root cause analysis, RCA document generation, and blameless post-mortems. Runs fully offline via Ollama local models, or connects to cloud AI providers.
Built with **Tauri 2** (Rust + WebView), **React 18**, **TypeScript**, and **SQLCipher AES-256** encrypted storage.
**CI status:** ![CI](https://github.com/msicie/apollo_nxt-trcaa/actions/workflows/test.yml/badge.svg) — all checks green (rustfmt · clippy · 64 Rust tests · tsc · vitest)
---
## Features
- **5-Whys AI Triage** — Guided root cause analysis via AI chat, with auto-detection of why levels 15
- **PII Sanitization** — Automatic detection and redaction of IPv4/IPv6, emails, tokens, passwords, SSNs, and more before any data leaves the machine
- **Multi-Provider AI** — OpenAI, Anthropic Claude, Google Gemini, Mistral, and local [Ollama](https://ollama.com) (offline)
- **Encrypted Database** — SQLCipher AES-256 encrypted SQLite; all issue history stays local
- **RCA + Post-Mortem Generation** — Auto-populated Markdown templates, exportable to `.md` and `.pdf`
- **Ollama Management** — Hardware detection, model recommendations, pull/delete models in-app
- **Audit Trail** — Every external data send logged with SHA-256 hash
- **Domain System Prompts** — Pre-built expert context for 8 IT domains (Linux, Windows, Network, Kubernetes, Databases, Virtualization, Hardware, Observability)
- **Image Attachments** — Upload and manage image files with PII detection and mandatory user approval
- **Integrations** *(v0.2, coming soon)* — Confluence, ServiceNow, Azure DevOps
---
## Supported Domains
| Domain | Coverage |
|---|---|
| Linux | RHEL/OEL, systemd, journald, SELinux, kernel panics |
| Windows | Event IDs, WinRM, BSOD codes, Server 2019/2022 |
| Network | Fortigate, Cisco IOS, Aruba AOS-CX, Nokia SR-OS, VoIP SIP/RTP |
| Kubernetes | k3s, OpenShift, CrashLoopBackOff, OOMKill, etcd, Rancher |
| Databases | PostgreSQL WAL, Redis AOF/RDB, RabbitMQ, MSSQL |
| Virtualization | Proxmox VE/PBS, VDI sessions |
| Hardware | HPE Synergy 12000, DL-20/320/360/380, iLO event logs |
| Observability | Kibana/ECK, Elasticsearch shard failures |
---
## Architecture
| Component | Technology |
|---|---|
| App framework | Tauri 2.x (Rust + WebView) |
| Frontend | React 18 + TypeScript + Vite |
| UI | Tailwind CSS (custom shadcn-style components) |
| Database | rusqlite + `bundled-sqlcipher` (AES-256) |
| Secret storage | `tauri-plugin-stronghold` |
| State management | Zustand (persisted settings store with API key redaction) |
| AI providers | reqwest (async HTTP) |
| PII detection | regex + aho-corasick multi-pattern engine |
---
## Prerequisites
### System Libraries (Linux — Fedora/RHEL)
```bash
sudo dnf install -y \
glib2-devel gtk3-devel webkit2gtk4.1-devel \
libsoup3-devel openssl-devel librsvg2-devel
```
### System Libraries (Linux — Debian/Ubuntu)
```bash
sudo apt-get install -y \
libwebkit2gtk-4.1-dev libssl-dev libgtk-3-dev \
libayatana-appindicator3-dev librsvg2-dev patchelf pkg-config
```
### Toolchain
```bash
# Rust (minimum 1.88 — required by cookie_store, time, darling)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
# Node.js 22+ (via your package manager)
# Verify:
rustc --version # 1.88+
node --version # 22+
```
---
## Getting Started
```bash
# Clone
git clone https://github.com/msicie/apollo_nxt-trcaa.git
cd apollo_nxt-trcaa
npm install --legacy-peer-deps
# Development mode (hot reload)
source ~/.cargo/env
cargo tauri dev
# Production build
cargo tauri build
# Output: src-tauri/target/release/bundle/
```
---
## Releases
Pre-built installers are attached to each [tagged release](https://github.com/msicie/apollo_nxt-trcaa/releases):
| Platform | Format | Notes |
|---|---|---|
| Linux amd64 | `.deb`, `.rpm`, `.AppImage` | Standard package or universal AppImage |
| Windows amd64 | `.exe` (NSIS), `.msi` | From cross-compile via mingw-w64 |
| Linux arm64 | `.deb`, `.rpm`, `.AppImage` | Built natively on arm64 runner |
| macOS ARM64 | `.dmg` | Native build on `macos-latest` |
| macOS Intel | `.dmg` | Native build on `macos-13` |
---
## AI Provider Setup
Launch the app and go to **Settings → AI Providers** to add a provider:
| Provider | API URL | Notes |
|---|---|---|
| OpenAI | `https://api.openai.com/v1` | Requires API key |
| Anthropic | `https://api.anthropic.com` | Requires API key |
| Google Gemini | `https://generativelanguage.googleapis.com` | Requires API key |
| Mistral | `https://api.mistral.ai/v1` | Requires API key |
| Ollama (local) | `http://localhost:11434` | No key needed — fully offline |
| Azure OpenAI | `https://<resource>.openai.azure.com/openai/deployments/<deployment>` | Requires API key |
| **AWS Bedrock (via LiteLLM)** | `http://localhost:8000/v1` | See [LiteLLM + AWS Bedrock](#litellm--aws-bedrock-setup) below |
For offline use, install [Ollama](https://ollama.com) and pull a model:
```bash
ollama pull llama3.2:3b # Good for most hardware (≥8 GB RAM)
ollama pull llama3.1:8b # Better quality (≥16 GB RAM)
```
Or use **Settings → Ollama** to pull models directly from within the app.
### LiteLLM + AWS Bedrock Setup
To use Claude via AWS Bedrock (ideal for enterprise environments with existing AWS contracts):
1. **Install LiteLLM:**
```bash
pip install litellm[proxy]
```
2. **Create config file** at `~/.litellm/config.yaml`:
```yaml
model_list:
- model_name: bedrock-claude
litellm_params:
model: bedrock/us.anthropic.claude-sonnet-4-6
aws_region_name: us-east-1
# Optionally specify aws_profile_name if not using default
general_settings:
master_key: sk-your-secure-key # Any value for API auth
```
3. **Start LiteLLM proxy:**
```bash
nohup litellm --config ~/.litellm/config.yaml --port 8000 > ~/.litellm/litellm.log 2>&1 &
```
4. **Configure in Troubleshooting and RCA Assistant:**
- Provider: **OpenAI** (OpenAI-compatible)
- Base URL: `http://localhost:8000/v1`
- API Key: `sk-your-secure-key` (from config)
- Model: `bedrock-claude`
For detailed setup including multiple AWS accounts and Claude Code integration, see the [LiteLLM + Bedrock wiki page](https://github.com/msicie/apollo_nxt-trcaa/wiki/LiteLLM-Bedrock-Setup).
---
## Triage Workflow
```
1. New Issue → Select domain, enter title and severity
2. Log Upload → Drag-and-drop log files, review PII redactions
3. Triage → 5-Whys AI conversation, auto-tracked why levels 15
4. Resolution → Review and confirm each root cause and action
5. RCA → Auto-generated RCA document, export as MD or PDF
6. Post-Mortem → Blameless post-mortem document with action items
```
---
## Project Structure
```
tftsr/
├── src-tauri/src/
│ ├── ai/ # AI provider clients (OpenAI, Anthropic, Gemini, Mistral, Ollama)
│ ├── pii/ # PII detection + redaction engine
│ ├── db/ # SQLCipher connection, migrations, models
│ ├── ollama/ # Hardware detection, model recommendations, download manager
│ ├── docs/ # RCA + post-mortem generators, PDF/MD exporters
│ ├── integrations/ # Confluence, ServiceNow, Azure DevOps (v0.2 stubs)
│ ├── audit/ # Audit log writer
│ ├── commands/ # Tauri IPC command handlers
│ ├── lib.rs # App builder, plugin registration, command handler registration
│ └── state.rs # AppState (DB connection, settings)
├── src/
│ ├── pages/ # Dashboard, NewIssue, LogUpload, Triage, Resolution, RCA, Postmortem, History, Settings
│ ├── components/ # ChatWindow, TriageProgress, PiiDiffViewer, DocEditor, HardwareReport, ModelSelector, UI
│ ├── stores/ # sessionStore, settingsStore (persisted), historyStore
│ ├── lib/ # tauriCommands.ts (typed IPC wrappers), domainPrompts.ts
│ └── styles/ # Tailwind + CSS custom properties
├── tests/
│ ├── unit/ # Vitest unit tests (PII, session store, settings store)
│ └── e2e/ # WebdriverIO + tauri-driver E2E skeletons
├── docs/wiki/ # Source of truth for GitHub wiki
└── .github/
└── workflows/
├── test.yml # CI: rustfmt · clippy · cargo test · tsc · vitest (every push/PR)
├── release.yml # Auto tag + release: linux/amd64 + linux/arm64 + windows/amd64 + macOS ARM64 + macOS Intel
└── build-images.yml # Build and push pre-baked CI images to ghcr.io
```
---
## Testing
```bash
# Unit tests (Vitest) — 13/13 passing
npm run test:run
# Frontend coverage
npm run test:coverage
# TypeScript type check
npx tsc --noEmit
# Rust checks — 64/64 tests passing
cargo check --manifest-path src-tauri/Cargo.toml
cargo test --manifest-path src-tauri/Cargo.toml
# E2E tests (requires compiled app binary)
TAURI_BINARY_PATH=./src-tauri/target/release/tftsr npm run test:e2e
```
---
## CI/CD — GitHub Actions
The project uses **GitHub Actions** with pre-baked builder images hosted on `ghcr.io/msicie/`.
| Workflow | Trigger | Jobs |
|---|---|---|
| `.github/workflows/test.yml` | Every push / PR targeting `main` | `rust-test` (fmt · clippy · cargo test) · `frontend-test` (tsc · vitest) |
| `.github/workflows/release.yml` | Push to `main` (auto-tag), then `v*` tags | Auto-tag, build linux/amd64 + linux/arm64 + windows/amd64 + macOS ARM64 + macOS Intel, upload to GitHub Releases |
| `.github/workflows/build-images.yml` | Changes to `.docker/**` on `main` | Build and push pre-baked CI images to `ghcr.io/msicie/` |
**Pre-baked CI images:**
| Image | Purpose |
|---|---|
| `ghcr.io/msicie/trcaa-linux-amd64:rust1.88-node22` | Test pipeline + linux/amd64 + windows cross-compile |
| `ghcr.io/msicie/trcaa-linux-arm64:rust1.88-node22` | linux/arm64 release builds |
| `ghcr.io/msicie/trcaa-windows-cross:rust1.88-node22` | Windows amd64 cross-compile |
**Branch protection:** `main` requires a PR with `rust-test` + `frontend-test` + CODEOWNER review before merge.
> See [CI/CD Pipeline wiki](https://github.com/msicie/apollo_nxt-trcaa/wiki/CICD-Pipeline) for full infrastructure docs.
---
## Security
| Concern | Implementation |
|---|---|
| API keys / tokens | AES-256-GCM encrypted at rest (backend), not persisted in browser storage |
| Database at rest | SQLCipher AES-256; key derived via PBKDF2 |
| PII before AI send | Rust-side detection + mandatory user approval in UI |
| Audit trail | Hash-chained audit entries (`prev_hash` + `entry_hash`) for tamper evidence |
| Network | `reqwest` with TLS; HTTP blocked by Tauri capability config |
| Capabilities | Least-privilege: scoped fs access, no arbitrary shell by default |
| CSP | Strict CSP in `tauri.conf.json`; no inline scripts |
| Telemetry | None — zero analytics, crash reporting, or usage tracking |
---
## Database
All data is stored locally in a SQLCipher-encrypted database at:
| OS | Path |
|---|---|
| Linux | `~/.local/share/tftsr/tftsr.db` |
| macOS | `~/Library/Application Support/tftsr/tftsr.db` |
| Windows | `%APPDATA%\tftsr\tftsr.db` |
Override with the `TFTSR_DATA_DIR` environment variable.
---
## Environment Variables
| Variable | Default | Purpose |
|---|---|---|
| `TFTSR_DATA_DIR` | Platform data dir | Override database location |
| `TFTSR_DB_KEY` | _(none)_ | Database encryption key (required in release builds) |
| `TFTSR_ENCRYPTION_KEY` | _(none)_ | Credential encryption key (required in release builds) |
| `RUST_LOG` | `info` | Tracing log level (`debug`, `info`, `warn`, `error`) |
---
## Implementation Status
| Phase | Description | Status |
|---|---|---|
| 1 | Scaffold & Foundation | ✅ Complete |
| 2 | Security & Database Layer | ✅ Complete |
| 3 | PII Sanitization Engine | ✅ Complete |
| 4 | AI Provider Layer | ✅ Complete |
| 5 | Ollama Integration | ✅ Complete |
| 6 | Log Upload & Analysis | ✅ Complete |
| 7 | 5-Whys Triage Engine | ✅ Complete |
| 8 | RCA & Post-Mortem Generation | ✅ Complete |
| 9 | History & Search | 🔲 Pending |
| 10 | Integrations (Confluence, ServiceNow, ADO) | 🔲 v0.2 |
| 11 | CI/CD Pipeline | ✅ Complete — GitHub Actions, all checks green |
| 12 | Release Packaging | ✅ linux/amd64 · linux/arm64 · windows/amd64 · macOS ARM64 · macOS Intel |
---