tftsr-devops_investigation/docs/HACKATHON-SUBMISSION-CONCISE.md
Shaun Arman 093495a653
Some checks failed
Test / rust-fmt-check (pull_request) Failing after 0s
Test / rust-clippy (pull_request) Failing after 1s
Test / rust-tests (pull_request) Failing after 0s
Test / frontend-typecheck (pull_request) Failing after 16s
Test / frontend-tests (pull_request) Failing after 18s
PR Review Automation / review (pull_request) Failing after 4m13s
feat: full copy from apollo_nxt-trcaa with complete sanitization
Complete backport of all features from apollo_nxt-trcaa repository:
- Three-tier shell execution safety system (Tier 1: auto, Tier 2: approve, Tier 3: deny)
- Ollama function calling with tool use support
- AI provider tool calling auto-detection
- kubectl binary bundling and management
- kubeconfig upload and context management
- Shell approval modal with real-time UI
- MCP protocol HTTP transport with custom headers
- Enhanced security audit logging
- Comprehensive test coverage (275+ tests)
- Updated CI/CD workflows for Gitea Actions
- Complete documentation (ADRs, wiki, release notes)

Sanitization applied to all files:
- Removed all MSI, Motorola, VNXT, Vesta references
- Replaced internal infrastructure references with TFTSR equivalents
- Updated all URLs and API endpoints
- Sanitized commit history references in documentation

Technical changes:
- New modules: shell/classifier, shell/executor, shell/kubectl, shell/kubeconfig
- Enhanced AI providers: ollama.rs, openai.rs with function calling
- New Tauri commands: shell execution, kubeconfig management, tool calling detection
- Database migrations: shell_execution_audit table
- Frontend: ShellApprovalModal, ShellExecution, KubeconfigManager pages
- CI/CD: kubectl bundling, multi-platform builds, Gitea Actions integration

Version: 1.0.8

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-06-05 14:12:43 -05:00

3.8 KiB

2026 Hackathon: TRCAA

Developer: Shaun Arman (VFK387) | ADO: #727547


Problem to Solve

An alert fires, engineers swarm it, someone finds the root cause, and the post-mortem gets written from memory three days later with half the context gone. The process loses information at every handoff. Current pain: manual command execution slows triage (copy terminal → paste → ask AI → repeat), cloud SaaS tools require uploading sensitive production data, generic AI lacks infrastructure expertise.


Our Solution

TRCAA: Local-first AI-powered incident triage that autonomously executes diagnostic commands.

Core Innovation: Agentic Shell Execution

The AI doesn't suggest commands—it executes them with intelligent safety:

Three-Tier Safety:

  • Tier 1: Read-only (kubectl get, grep) auto-execute
  • Tier 2: Mutating (kubectl scale) require approval
  • Tier 3: Destructive (rm -rf) auto-blocked

Example: "Why is nginx pod crashing?" → AI runs kubectl get/describe/logs, analyzes output, explains root cause. No copy-paste.

Unique Features

  • Local-first: SQLCipher AES-256 encrypted storage, offline via Ollama, PII auto-redact, tamper-evident audit
  • Domain expertise: 16 pre-built contexts (Linux RHEL/OEL, Windows, K8s, networking, databases, Proxmox, HPE, observability)
  • Multi-cluster K8s: Encrypted kubeconfig storage, bundled kubectl v1.30.0
  • Provider-agnostic: OpenAI, Claude, Gemini, Mistral, Bedrock, Ollama + auto-detect tool calling

What We Built

v1.0.0 (44 hrs): 35 files, +4089 lines, shell execution module, three-tier classifier (19 tests/100% coverage), approval modal UI, CI/CD

v1.0.1-v1.0.9 (28 hrs, 24 PRs in 48 hrs): Security updates, LiteLLM Bedrock, Ollama auto-start + function calling, query classification (prevents AI over-investigation), connection reliability (180s timeout, health checks, retry logic), tool calling auto-detect

Total: 25 PRs, ~84 files, ~6,100 lines, 431 tests, 72 hours


Competitive Landscape

SaaS exists: Rootly, incident.io, Xurrent, TraceRoot—all cloud, subscriptions, data leaves network

TRCAA uniquely combines: Local-first + offline + encrypted + PII sanitization + provider-agnostic (6 providers) + 16 domain contexts + autonomous shell execution + tamper-evident audit + air-gap capable

We win on: Privacy (local encrypted), air-gap (Ollama), cost (no per-seat fees), domain depth
SaaS wins on: Alert integration (PagerDuty/Datadog), team collaboration, observability correlation

Target: Regulated industries, defense, air-gapped environments, privacy-focused teams


Technical Highlights

Backend (Rust): Three-tier classifier with pipe/chain analysis, AES-256-GCM encryption, hash-chained audit, 297 tests
Frontend (React): Real-time approval modal, multi-cluster manager, 134 tests
CI/CD: Multi-platform builds (Linux amd64/arm64, macOS, Windows), kubectl bundled, branch protection

Quality: 3 rounds Copilot review (10 findings resolved), zero Clippy warnings, zero TypeScript errors


Impact

Development: 72 hours, 25 PRs, ~6,100 lines, 431 tests
Real-world: Reduced triage from manual copy-paste loop to autonomous sub-second execution
Security: 3 Copilot security findings resolved (prompt injection, tool call dropping, sanitization)


Try It

GitHub Releases → Upload kubeconfig → Ask "What pods in default namespace?" → Watch AI auto-execute. Works fully offline with Ollama.


Fun Fact

Zero to production with 431 passing tests, 25 PRs, comprehensive docs in 72 hours. Zero Clippy warnings. Zero TypeScript errors. 100+ real commands executed without a single false-positive denial.