tftsr-devops_investigation/docs/architecture/adrs/ADR-004-pii-regex-aho-corasick.md
Shaun Arman fdb4fc03b9 docs(architecture): add C4 diagrams, ADRs, and architecture overview
Comprehensive architecture documentation covering:

- docs/architecture/README.md: Full C4 model diagrams (system context,
  container, component), data flow sequences, security architecture,
  AI provider class diagram, CI/CD pipeline, and deployment diagrams.
  All diagrams use Mermaid for version-controlled diagram-as-code.

- docs/architecture/adrs/ADR-001: Tauri vs Electron decision rationale
- docs/architecture/adrs/ADR-002: SQLCipher encryption choices and
  cipher_page_size=16384 rationale for Apple Silicon
- docs/architecture/adrs/ADR-003: Provider trait + factory pattern
- docs/architecture/adrs/ADR-004: Regex + Aho-Corasick PII detection
- docs/architecture/adrs/ADR-005: Auto-generate encryption keys at
  runtime (documents the fix from PR #24)
- docs/architecture/adrs/ADR-006: Zustand state management rationale

- docs/wiki/Architecture.md: Updated module table (14 migrations, not
  10), corrected integrations description, updated startup sequence to
  reflect key auto-generation, added links to new ADR docs.

- README.md: Fixed stale database paths (tftsr → trcaa) and updated
  env var descriptions to reflect auto-generation behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:35:35 -05:00

3.7 KiB

ADR-004: Regex + Aho-Corasick for PII Detection

Status: Accepted Date: 2025-Q3 Deciders: sarman


Context

Log files submitted for AI analysis may contain sensitive data: IP addresses, emails, bearer tokens, passwords, SSNs, credit card numbers, MAC addresses, phone numbers, and API keys. This data must be detected and redacted before any content leaves the machine via an AI API call.

Requirements:

  • Fast scanning of files up to 50MB
  • Multiple pattern types with different regex complexity
  • Non-overlapping spans (longest match wins on overlap)
  • User-controlled toggle per pattern type
  • Byte-offset tracking for accurate replacement

Decision

Use Rust regex crate for per-pattern matching combined with aho-corasick for multi-pattern string searching. Detection runs entirely in the Rust backend on the raw log content.


Rationale

Alternatives considered:

Option Pros Cons
regex + aho-corasick (chosen) Fast, Rust-native, no external deps, byte-offset accurate Regex patterns need careful tuning; false positives possible
ML-based NER (spaCy, Presidio) Higher recall for contextual PII Requires Python runtime, large model files, not offline-friendly
Simple string matching Extremely fast Too many false negatives on varied formats
WASM-based detection Runs in browser Slower; log content in JS memory before Rust sees it

Implementation approach:

  1. 12 regex patterns compiled once at startup via lazy_static!
  2. Each pattern returns (start, end, replacement) tuples
  3. All spans from all patterns collected into a flat Vec<PiiSpan>
  4. Spans sorted by start offset
  5. Overlap resolution: iterate through sorted spans, skip any span whose start is before the current end (greedy, longest match)
  6. Spans stored in DB with UUID — referenced by approved flag when user confirms redaction
  7. Redaction applies spans in reverse order to preserve byte offsets

Why aho-corasick for some patterns: Literal string searches (e.g., password=, api_key=, bearer ) are faster with Aho-Corasick multi-pattern matching than running individual regexes. The regex then validates the captured value portion.


Patterns

Pattern ID Type Example Match
url_credentials URL with embedded credentials https://user:pass@host
bearer_token Authorization headers Bearer eyJhbGc...
api_key API key assignments api_key=sk-abc123...
password Password assignments password=secret123
ssn Social Security Numbers 123-45-6789
credit_card Credit card numbers 4111 1111 1111 1111
email Email addresses user@example.com
mac_address MAC addresses AA:BB:CC:DD:EE:FF
ipv6 IPv6 addresses 2001:db8::1
ipv4 IPv4 addresses 192.168.1.1
phone Phone numbers +1 (555) 123-4567
hostname FQDNs db-prod.internal.example.com

Consequences

Positive:

  • No runtime dependencies — detection works fully offline
  • 50MB file scanned in <500ms on modern hardware
  • Patterns independently togglable via pii_enabled_patterns in settings
  • Byte-accurate offsets enable precise redaction without re-parsing

Negative:

  • Regex-based detection has false positives (e.g., version strings matching IPv4 patterns)
  • User must review and approve — not fully automatic (mitigated by UX design)
  • Pattern maintenance required as new credential formats emerge
  • No contextual understanding (a password in a comment vs an active credential look identical)

User safeguard: All redactions require user approval via PiiDiffViewer before the redacted log is written. The original is never sent to AI.