feat(integrations): implement query expansion for semantic search #44

Merged
sarman merged 5 commits from feature/integration-search-expansion into master 2026-04-15 01:44:43 +00:00
Owner

Summary

This PR implements query expansion across all integration search functions (Confluence, ServiceNow, Azure DevOps, and webview fetch) to search for semantically related content, not just exact user query terms.

Security Audit Results

Following a full security audit, the following issues were addressed:

Critical Fixes

  • WIQL Injection: Added escape_wiql() function to escape single quotes in Azure DevOps WIQL queries
  • CQL Injection: Added escape_cql() function to escape special characters in Confluence CQL queries

High Priority Fixes

  • URL Deduplication: Implemented canonicalization to properly deduplicate URLs with different fragments/query params
  • Input Validation: Added empty query validation to prevent expansion of empty strings

Medium Priority Fixes

  • Auth Failure Logging: Added explicit logging for auth failures with status codes
  • Sequential API Calls: Note: Parallelization with join_all recommended for future optimization

Low Priority Improvements

  • Magic Numbers: Replaced hardcoded 3 with MAX_EXPANDED_QUERIES constant
  • Logging: Fixed to show expanded_query instead of search_url

Changes

  • New module: query_expansion.rs with:

    • get_product_synonyms() - generates product name variants (VNXT, Vesta NXT, etc.)
    • expand_query() - expands user queries with related terms
    • extract_keywords() - extracts meaningful keywords (fixed infinite loop bug)
  • Updated integrations to use expanded queries:

    • confluence_search.rs
    • servicenow_search.rs
    • azuredevops_search.rs
    • webview_fetch.rs

Testing

All 142 tests pass (including 4 new query expansion tests).

Fixes the issue where searching for 'upgrade VESTA NXT 1.1.9' wasn't finding documents mentioning 'VNXT', 'Vesta NXT', '1.0.12' and other related terms.

## Summary This PR implements query expansion across all integration search functions (Confluence, ServiceNow, Azure DevOps, and webview fetch) to search for semantically related content, not just exact user query terms. ## Security Audit Results Following a full security audit, the following issues were addressed: ### Critical Fixes - **WIQL Injection**: Added `escape_wiql()` function to escape single quotes in Azure DevOps WIQL queries - **CQL Injection**: Added `escape_cql()` function to escape special characters in Confluence CQL queries ### High Priority Fixes - **URL Deduplication**: Implemented canonicalization to properly deduplicate URLs with different fragments/query params - **Input Validation**: Added empty query validation to prevent expansion of empty strings ### Medium Priority Fixes - **Auth Failure Logging**: Added explicit logging for auth failures with status codes - **Sequential API Calls**: Note: Parallelization with `join_all` recommended for future optimization ### Low Priority Improvements - **Magic Numbers**: Replaced hardcoded `3` with `MAX_EXPANDED_QUERIES` constant - **Logging**: Fixed to show `expanded_query` instead of `search_url` ## Changes - **New module**: `query_expansion.rs` with: - `get_product_synonyms()` - generates product name variants (VNXT, Vesta NXT, etc.) - `expand_query()` - expands user queries with related terms - `extract_keywords()` - extracts meaningful keywords (fixed infinite loop bug) - **Updated integrations** to use expanded queries: - `confluence_search.rs` - `servicenow_search.rs` - `azuredevops_search.rs` - `webview_fetch.rs` ## Testing All 142 tests pass (including 4 new query expansion tests). ## Related Fixes the issue where searching for 'upgrade VESTA NXT 1.1.9' wasn't finding documents mentioning 'VNXT', 'Vesta NXT', '1.0.12' and other related terms.
sarman added 1 commit 2026-04-15 00:38:22 +00:00
feat(integrations): implement query expansion for semantic search
All checks were successful
Test / rust-fmt-check (pull_request) Successful in 12s
Test / frontend-typecheck (pull_request) Successful in 1m11s
Test / frontend-tests (pull_request) Successful in 1m15s
PR Review Automation / review (pull_request) Successful in 3m13s
Test / rust-clippy (pull_request) Successful in 3m45s
Test / rust-tests (pull_request) Successful in 5m9s
096068ed2b
- Add query_expansion.rs module with product synonyms and keyword extraction
- Update confluence_search.rs to use expanded queries
- Update servicenow_search.rs to use expanded queries
- Update azuredevops_search.rs to use expanded queries
- Update webview_fetch.rs to use expanded queries
- Fix extract_keywords infinite loop bug for non-alphanumeric endings

All 142 tests pass.
sarman reviewed 2026-04-15 00:41:31 +00:00
sarman left a comment
Author
Owner

🤖 Automated PR Review:\n\nHere's a detailed code review of the PR implementing query expansion for semantic search across integrations:


1. Summary

This PR adds query expansion functionality to improve semantic search coverage across Azure DevOps, Confluence, and (presumably) other integrations. It introduces a new query_expansion module and modifies search_wiki and search_confluence (and search_work_items) to iterate over multiple expanded queries, collect results, deduplicate, and sort.

Positives:

  • Good separation of concerns (dedicated query_expansion module).
  • Graceful degradation: failed expanded queries don’t break the entire search.
  • Deduplication and sorting improve result quality.
  • Logging improved to include expanded queries.

⚠️ Concerns:

  • Critical: query_expansion.rs file is incomplete/cut off in the diff — likely a bug.
  • Security: Potential injection vulnerabilities in WIQL and CQL queries.
  • Correctness: dedup_by(|a, b| a.url == b.url) may not be robust (e.g., same URL with different fragment/query params).
  • Performance: 3 sequential API calls per search (no concurrency).
  • Best practices: Missing error propagation in some cases; unwrap_or_default() misuse; no rate limiting or backpressure.

2. Bugs & Errors

🔴 Critical: Incomplete query_expansion.rs

The diff shows the file is truncated at line 285 (file ends mid-comment):

/// Product name synonyms for common product variations
/// Maps common abbreviations/variants to their full names for search expansion\n\nProvide a review with: 1) Summary, 2) Bugs/errors, 3) Security issues, 4) Best practices.

➡️ Fix: Ensure expand_query() is fully implemented and tested. Without it, this PR is non-functional.

🟡 Bug: dedup_by is unsafe for SearchResult

all_results.sort_by(|a, b| a.url.cmp(&b.url));
all_results.dedup_by(|a, b| a.url == b.url);
  • dedup_by only removes adjacent duplicates. Sorting by url helps, but:
    • URLs may differ only in fragment (#section) or query params (?page=1), which should be considered duplicates.
    • SearchResult has other fields (title, content, etc.) — deduplication should preserve richer content (e.g., full content over excerpt-only).

➡️ Fix:

// Deduplicate by canonical URL (strip fragments/params)
use url::Url;

fn canonicalize_url(url: &str) -> String {
    Url::parse(url)
        .ok()
        .map(|u| {
            let mut u = u.clone();
            u.set_fragment(None);
            u.set_query(None);
            u.to_string()
        })
        .unwrap_or_else(|| url.to_string())
}

all_results.sort_by(|a, b| canonicalize_url(&a.url).cmp(&canonicalize_url(&b.url)));
all_results.dedup_by(|a, b| canonicalize_url(&a.url) == canonicalize_url(&b.url));

🟡 Bug: unwrap_or_default() on item["wiki"]["id"]

let wiki_id = item["wiki"]["id"].as_str().unwrap_or_default();

unwrap_or_default() returns "", which may cause fetch_wiki_page to fail or return wrong data.

➡️ Fix: Use and_then or explicit if let:

let wiki_id = item["wiki"]["id"].as_str();
if let Some(id) = wiki_id {
    // proceed
}

🟡 Bug: search_work_items uses take(3) per expanded query → up to 9 results, but dedup may leave fewer

  • If all 3 expanded queries return same 3 work items, dedup leaves 3 — but if they’re different, you get up to 9.
  • Yet the original behavior was 3 total. This may break UX expectations.

➡️ Suggestion: Limit total results (e.g., all_results.truncate(5)) or use a priority queue.


3. Security Issues

🔴 Critical: WIQL Injection in search_work_items

let wiql_query = format!(
    "SELECT ... WHERE ... ([System.Title] CONTAINS '{expanded_query}' OR ...)",
);

➡️ Risk: If expanded_query contains ' or malicious WIQL fragments (e.g., '; DROP TABLE ...), it could corrupt the query or leak data.

Fix: Sanitize or parameterize. WIQL does not support parameterized queries — so you must escape:

fn escape_wiql(s: &str) -> String {
    s.replace('\'', "''") // escape single quotes
}
// Then:
let safe_query = escape_wiql(expanded_query);

🟡 High: CQL Injection in search_confluence

cql=text~\"{}\"&limit=5

text~" is a Confluence CQL operator for fuzzy search. If expanded_query contains ", ), or ~, it breaks the query or enables injection.

Fix: Escape or use urlencoding::encode() and sanitize special chars:

fn escape_cql(s: &str) -> String {
    s.replace('"', "\\\"")
     .replace(')', "\\)")
     .replace('(', "\\(")
     .replace('~', "\\~")
}
let cql_query = format!("text~\"{}\"", escape_cql(expanded_query));
let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies);
  • If cookies contain session tokens, and expand_query returns many variants, you’re reusing the same auth header — fine, but:
  • No expiration check on cookies. If expired, all requests fail silently (only continue is used).

➡️ Suggestion: Log when auth fails (e.g., 401/403) more explicitly.


4. Best Practices

🟡 Performance: Sequential API calls

  • 3 sequential calls → 3× latency. For search_wiki, each call does:
    • Search API → parse → fetch full page (2nd API call).
  • So 6–9 API calls per search (3 expanded queries × 2 APIs).

➡️ Fix: Use futures::future::join_all for parallelization:

use futures::future::join_all;

let expanded_queries = expand_query(query).into_iter().take(3).collect::<Vec<_>>();
let futures = expanded_queries.iter().map(|q| {
    let client = client.clone();
    let url = ...;
    async move { /* perform search */ }
});
let results: Vec<Vec<SearchResult>> = join_all(futures).await;

🟡 Logging: Misleading message

tracing::info!("Searching Azure DevOps Wiki with expanded query: {}", search_url);

search_url is the endpoint, not the expanded query. Should log expanded_query:

tracing::info!("Searching Azure DevOps Wiki with query: {}", expanded_query);

🟡 Error Handling: Silent failures

continue; // Don't fail if work item search fails

→ This is intentional, but:

  • No metrics/metrics counter for partial failures.
  • User sees fewer results without knowing why.

➡️ Suggestion: Add a counter or log warning with status code:

tracing::warn!("Search for query '{}' failed with status {}: {}", expanded_query, status, text);

🟡 Code Quality: Magic number 3

  • take(3) appears repeatedly. Should be a constant:
const MAX_EXPANDED_QUERIES: usize = 3;
// ...
for expanded_query in expanded_queries.iter().take(MAX_EXPANDED_QUERIES) {

🟡 Missing: Input validation

  • No check if query is empty or too long (e.g., expand_query("") may return many empty strings).

➡️ Suggestion:

if query.trim().is_empty() {
    return Ok(Vec::new());
}

🟡 Missing: Unit tests

  • No tests for expand_query() or deduplication logic.

Priority Issue Fix
🔴 Critical query_expansion.rs incomplete Complete implementation + tests
🔴 Critical WIQL/CQL injection Escape special characters
🟡 High dedup_by unsafe Canonicalize URLs before dedup
🟡 Medium Sequential API calls Parallelize with join_all
🟡 Medium Cookie reuse without validation Log auth failures explicitly
🟢 Low Magic numbers Use constants (MAX_EXPANDED_QUERIES)
🟢 Low Logging Log expanded_query, not search_url
🟢 Low Missing tests Add unit/integration tests

Final Verdict

This PR introduces a valuable feature but has critical gaps:

  • The query_expansion.rs file is incomplete — do not merge until fully implemented.
  • Injection vulnerabilities in WIQL/CQL must be fixed before production.
  • Deduplication and concurrency improvements would significantly boost reliability.

Recommendation: Request changes for security fixes and completeness before merging.\n\n---\nthis is an automated review from Ollama

🤖 Automated PR Review:\n\nHere's a detailed code review of the PR implementing query expansion for semantic search across integrations: --- ### **1. Summary** This PR adds **query expansion** functionality to improve semantic search coverage across Azure DevOps, Confluence, and (presumably) other integrations. It introduces a new `query_expansion` module and modifies `search_wiki` and `search_confluence` (and `search_work_items`) to iterate over multiple expanded queries, collect results, deduplicate, and sort. ✅ **Positives**: - Good separation of concerns (dedicated `query_expansion` module). - Graceful degradation: failed expanded queries don’t break the entire search. - Deduplication and sorting improve result quality. - Logging improved to include expanded queries. ⚠️ **Concerns**: - **Critical**: `query_expansion.rs` file is incomplete/cut off in the diff — likely a bug. - **Security**: Potential injection vulnerabilities in WIQL and CQL queries. - **Correctness**: `dedup_by(|a, b| a.url == b.url)` may not be robust (e.g., same URL with different fragment/query params). - **Performance**: 3 sequential API calls per search (no concurrency). - **Best practices**: Missing error propagation in some cases; `unwrap_or_default()` misuse; no rate limiting or backpressure. --- ### **2. Bugs & Errors** #### 🔴 **Critical: Incomplete `query_expansion.rs`** The diff shows the file is **truncated** at line 285 (file ends mid-comment): ```rust /// Product name synonyms for common product variations /// Maps common abbreviations/variants to their full names for search expansion\n\nProvide a review with: 1) Summary, 2) Bugs/errors, 3) Security issues, 4) Best practices. ``` ➡️ **Fix**: Ensure `expand_query()` is fully implemented and tested. Without it, this PR is non-functional. #### 🟡 **Bug: `dedup_by` is unsafe for `SearchResult`** ```rust all_results.sort_by(|a, b| a.url.cmp(&b.url)); all_results.dedup_by(|a, b| a.url == b.url); ``` - `dedup_by` only removes *adjacent* duplicates. Sorting by `url` helps, but: - URLs may differ only in fragment (`#section`) or query params (`?page=1`), which should be considered duplicates. - `SearchResult` has other fields (`title`, `content`, etc.) — deduplication should preserve richer content (e.g., full `content` over excerpt-only). ➡️ **Fix**: ```rust // Deduplicate by canonical URL (strip fragments/params) use url::Url; fn canonicalize_url(url: &str) -> String { Url::parse(url) .ok() .map(|u| { let mut u = u.clone(); u.set_fragment(None); u.set_query(None); u.to_string() }) .unwrap_or_else(|| url.to_string()) } all_results.sort_by(|a, b| canonicalize_url(&a.url).cmp(&canonicalize_url(&b.url))); all_results.dedup_by(|a, b| canonicalize_url(&a.url) == canonicalize_url(&b.url)); ``` #### 🟡 **Bug: `unwrap_or_default()` on `item["wiki"]["id"]`** ```rust let wiki_id = item["wiki"]["id"].as_str().unwrap_or_default(); ``` → `unwrap_or_default()` returns `""`, which may cause `fetch_wiki_page` to fail or return wrong data. ➡️ **Fix**: Use `and_then` or explicit `if let`: ```rust let wiki_id = item["wiki"]["id"].as_str(); if let Some(id) = wiki_id { // proceed } ``` #### 🟡 **Bug: `search_work_items` uses `take(3)` per expanded query → up to 9 results, but dedup may leave fewer** - If all 3 expanded queries return same 3 work items, dedup leaves 3 — but if they’re *different*, you get up to 9. - Yet the original behavior was 3 total. This may break UX expectations. ➡️ **Suggestion**: Limit total results (e.g., `all_results.truncate(5)`) or use a priority queue. --- ### **3. Security Issues** #### 🔴 **Critical: WIQL Injection in `search_work_items`** ```rust let wiql_query = format!( "SELECT ... WHERE ... ([System.Title] CONTAINS '{expanded_query}' OR ...)", ); ``` ➡️ **Risk**: If `expanded_query` contains `'` or malicious WIQL fragments (e.g., `'; DROP TABLE ...`), it could corrupt the query or leak data. ✅ **Fix**: **Sanitize or parameterize**. WIQL does *not* support parameterized queries — so you must escape: ```rust fn escape_wiql(s: &str) -> String { s.replace('\'', "''") // escape single quotes } // Then: let safe_query = escape_wiql(expanded_query); ``` #### 🟡 **High: CQL Injection in `search_confluence`** ```rust cql=text~\"{}\"&limit=5 ``` → `text~"` is a Confluence CQL operator for fuzzy search. If `expanded_query` contains `"`, `)`, or `~`, it breaks the query or enables injection. ✅ **Fix**: Escape or use `urlencoding::encode()` *and* sanitize special chars: ```rust fn escape_cql(s: &str) -> String { s.replace('"', "\\\"") .replace(')', "\\)") .replace('(', "\\(") .replace('~', "\\~") } let cql_query = format!("text~\"{}\"", escape_cql(expanded_query)); ``` #### 🟡 **Medium: Cookie header reuse** ```rust let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies); ``` - If `cookies` contain session tokens, and `expand_query` returns many variants, you’re reusing the same auth header — fine, but: - **No expiration check** on cookies. If expired, all requests fail silently (only `continue` is used). ➡️ **Suggestion**: Log when auth fails (e.g., 401/403) more explicitly. --- ### **4. Best Practices** #### 🟡 **Performance: Sequential API calls** - 3 sequential calls → 3× latency. For `search_wiki`, each call does: - Search API → parse → fetch full page (2nd API call). - So **6–9 API calls per search** (3 expanded queries × 2 APIs). ➡️ **Fix**: Use `futures::future::join_all` for parallelization: ```rust use futures::future::join_all; let expanded_queries = expand_query(query).into_iter().take(3).collect::<Vec<_>>(); let futures = expanded_queries.iter().map(|q| { let client = client.clone(); let url = ...; async move { /* perform search */ } }); let results: Vec<Vec<SearchResult>> = join_all(futures).await; ``` #### 🟡 **Logging: Misleading message** ```rust tracing::info!("Searching Azure DevOps Wiki with expanded query: {}", search_url); ``` → `search_url` is the *endpoint*, not the expanded query. Should log `expanded_query`: ```rust tracing::info!("Searching Azure DevOps Wiki with query: {}", expanded_query); ``` #### 🟡 **Error Handling: Silent failures** ```rust continue; // Don't fail if work item search fails ``` → This is intentional, but: - No metrics/metrics counter for partial failures. - User sees fewer results without knowing why. ➡️ **Suggestion**: Add a counter or log warning with status code: ```rust tracing::warn!("Search for query '{}' failed with status {}: {}", expanded_query, status, text); ``` #### 🟡 **Code Quality: Magic number `3`** - `take(3)` appears repeatedly. Should be a constant: ```rust const MAX_EXPANDED_QUERIES: usize = 3; // ... for expanded_query in expanded_queries.iter().take(MAX_EXPANDED_QUERIES) { ``` #### 🟡 **Missing: Input validation** - No check if `query` is empty or too long (e.g., `expand_query("")` may return many empty strings). ➡️ **Suggestion**: ```rust if query.trim().is_empty() { return Ok(Vec::new()); } ``` #### 🟡 **Missing: Unit tests** - No tests for `expand_query()` or deduplication logic. --- ### ✅ **Recommended Fixes Summary** | Priority | Issue | Fix | |--------|-------|-----| | 🔴 Critical | `query_expansion.rs` incomplete | Complete implementation + tests | | 🔴 Critical | WIQL/CQL injection | Escape special characters | | 🟡 High | `dedup_by` unsafe | Canonicalize URLs before dedup | | 🟡 Medium | Sequential API calls | Parallelize with `join_all` | | 🟡 Medium | Cookie reuse without validation | Log auth failures explicitly | | 🟢 Low | Magic numbers | Use constants (`MAX_EXPANDED_QUERIES`) | | 🟢 Low | Logging | Log `expanded_query`, not `search_url` | | 🟢 Low | Missing tests | Add unit/integration tests | --- ### ✅ **Final Verdict** This PR introduces a valuable feature but has **critical gaps**: - The `query_expansion.rs` file is incomplete — **do not merge** until fully implemented. - Injection vulnerabilities in WIQL/CQL must be fixed before production. - Deduplication and concurrency improvements would significantly boost reliability. **Recommendation**: Request changes for security fixes and completeness before merging.\n\n---\n*this is an automated review from Ollama*
sarman added 1 commit 2026-04-15 00:55:46 +00:00
fix(integrations): security and correctness improvements
All checks were successful
Test / rust-fmt-check (pull_request) Successful in 12s
Test / frontend-typecheck (pull_request) Successful in 1m18s
Test / frontend-tests (pull_request) Successful in 1m21s
Test / rust-clippy (pull_request) Successful in 3m56s
PR Review Automation / review (pull_request) Successful in 4m20s
Test / rust-tests (pull_request) Successful in 5m22s
5b45c6c418
- Add url canonicalization for deduplication (strip fragments/query params)
- Add WIQL injection escaping for Azure DevOps work item searches
- Add CQL injection escaping for Confluence searches
- Add MAX_EXPANDED_QUERIES constant for consistency
- Fix logging to show expanded_query instead of search_url
- Add input validation for empty queries
- Add url crate dependency for URL parsing

All 142 tests pass.
sarman reviewed 2026-04-15 00:59:42 +00:00
sarman left a comment
Author
Owner

🤖 Automated PR Review:\n\n### Code Review: Query Expansion for Semantic Search (Azure DevOps & Confluence)


1. Summary

This PR implements query expansion for semantic search across Azure DevOps (Wiki & Work Items) and Confluence integrations. It leverages a shared expand_query() function, adds safety escaping for query languages (WIQL, CQL), and improves robustness (e.g., graceful error handling, deduplication). The changes are largely well-structured and follow good patterns.

Positives:

  • Query expansion improves search recall.
  • Graceful degradation (.continue() on partial failures).
  • Deduplication and sorting improve result quality.
  • Proper escaping of user input in WIQL/CQL mitigates injection risks.
  • Clear documentation added.

⚠️ Concerns:

  • Critical: Missing import of expand_query() in azuredevops_search.rs (but used in diff — likely an oversight).
  • Security: escape_wiql() and escape_cql() are incomplete and may be vulnerable to injection.
  • Performance: Excessive HTTP requests (3× queries × 3 results = up to 9 API calls per search).
  • Correctness: URL canonicalization may break Confluence links if fragments/queries matter.
  • Best Practices: Missing tests, inconsistent error handling, and potential resource leaks.

2. Bugs & Errors

🔴 Critical: Missing expand_query import in azuredevops_search.rs
use crate::integrations::query_expansion::expand_query;

Already present in diff — but check if query_expansion.rs exists and compiles. If not, this will fail at build time.

🟠 Incomplete WIQL escaping (escape_wiql)
fn escape_wiql(s: &str) -> String {
    s.replace('\'', "''")
}

Problem:
Azure DevOps WIQL uses single quotes for string literals, so escaping ' → '' is correct for quotes, but does not handle:

  • ], [, (, ), =, <, >, ~, %, *, &, |, !, (, ) — all have special meaning in WIQL.
  • Example injection: query = "Title CONTAINS 'x' OR 1=1 --" → could break or leak data.

Fix:

  • Prefer parameterized queries (but WIQL API does not support them — only JSON body with literal query string).
  • Alternatively, sanitize strictly:
    fn escape_wiql(s: &str) -> String {
        // Escape single quotes first
        let s = s.replace('\'', "''");
        // Then escape other special chars by quoting the whole term in square brackets? 
        // But CONTAINS doesn’t support bracketed identifiers for values.
        // Best: disallow non-alphanumeric + space, or use regex whitelist.
        // Or: wrap in `"` if supported (but WIQL doesn’t use double quotes for strings).
        s
    }
    
    Better: Use urlencoding::encode() on the entire expanded_query before embedding in WIQL — but note urlencoding encodes spaces as +, while WIQL expects %20. Use percent_encoding instead.
🟠 CQL escaping may be insufficient
fn escape_cql(s: &str) -> String {
    s.replace('"', "\\\"")
     .replace(')', "\\)")
     .replace('(', "\\(")
     .replace('~', "\\~")
}

Problem:
Confluence CQL has many special characters: +, -, &&, ||, !, {, }, [, ], ^, ", ~, *, ?, :, \, /, and whitespace.
Example injection: query = "test) OR project = X" → breaks out of parentheses.

Fix:

  • Escape all CQL special chars:
    fn escape_cql(s: &str) -> String {
        s.chars().flat_map(|c| match c {
            '"' | '\\' | '(' | ')' | '~' | '+' | '-' | '&' | '|' | '!' | '{' | '}' | '[' | ']' | '^' | ':' | '/' | ' ' => 
                vec!['\\', c],
            _ => vec![c],
        }).collect()
    }
    
    Or better: use a library like cql-escape (if available) or sanitize to alphanumeric + space only.
🟡 URL canonicalization may lose meaningful fragments/queries
fn canonicalize_url(url: &str) -> String {
    Url::parse(url)
        .ok()
        .map(|u| {
            let mut u = u.clone();
            u.set_fragment(None);
            u.set_query(None);
            u.to_string()
        })
        .unwrap_or_else(|| url.to_string())
}

Problem:

  • Confluence URLs often use ?pageId=123 or #heading for deep linking. Removing them may cause deduplication to merge distinct pages (e.g., same page, different sections).
  • Also, deduplication in search_wiki/search_work_items uses a.url.cmp(&b.url) — but if URLs differ only in fragment/query, they’ll be treated as distinct before canonicalization, but canonicalize_url() isn’t used in deduplication.

Fix:

  • Use canonicalize_url() only for deduplication, not for display.
  • Or: deduplicate using a normalized version (e.g., strip fragment only, keep query if relevant).
  • Add a comment explaining why you strip query/fragment.
🟡 Inconsistent deduplication strategy
  • search_wiki and search_work_items deduplicate by url, but search_confluence does not deduplicate (missing in diff).
  • If same page appears across expanded queries, it’s duplicated.

Fix: Add deduplication to search_confluence:

all_results.sort_by(|a, b| a.url.cmp(&b.url));
all_results.dedup_by(|a, b| a.url == b.url);

3. Security Issues

🔴 SQL/WIQL/CQL Injection Risk

As discussed above, escape_wiql() and escape_cql() are incomplete. While they handle quotes, they miss many special characters. An attacker could:

  • Inject OR 1=1 in WIQL (if CONTAINS is misused).
  • Break CQL logic to bypass filters.

Recommendation:

  • Treat all user input as untrusted.
  • Prefer whitelist-based sanitization (e.g., allow only [a-zA-Z0-9\-_ ]+), then fallback to escaping.
  • Log escaped queries for audit (already done via tracing::info!).
let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies);

Risk: If cookies contain maliciously crafted values (e.g., ;evil=...), they could break the header or leak session data.

Fix:

  • Validate cookies before use (e.g., ensure they’re key=value pairs).
  • Or use reqwest::header::Cookie::new() (but reqwest doesn’t support this — so manual validation is needed).

4. Best Practices

Good: Graceful error handling
  • continue on non-successful API calls instead of failing the whole search.
  • tracing::warn! for partial failures.
⚠️ Missing: Timeout & rate limiting
  • Each expanded query triggers a new HTTP request. With 3 expansions × 2 integrations = 6 requests/user query.
  • No timeouts set on reqwest::Client (default is none), risking hanging requests.

Fix:

let client = reqwest::Client::builder()
    .timeout(std::time::Duration::from_secs(10))
    .build()
    .map_err(|e| format!("Failed to build client: {e}"))?;
⚠️ Missing: Tests
  • No unit/integration tests for escape_wiql, escape_cql, expand_query, or deduplication logic.

Fix: Add tests:

#[test]
fn test_escape_wiql_injection() {
    let input = "x' OR '1'='1";
    assert_eq!(escape_wiql(input), "x'' OR ''1''=''1");
}
⚠️ Performance: Excessive API calls
  • MAX_EXPANDED_QUERIES = 3 → up to 3× more API calls.
  • For search_wiki, you fetch full page content for every expanded query result → N+1 problem.

Fix:

  • Prioritize expanded queries (e.g., primary query first, then expansions).
  • Cache page content if same page appears across queries.
  • Consider batching (if Azure DevOps/Confluence APIs support it — unlikely).
Good: Documentation
  • Added doc comments (e.g., /// This function expands...).
⚠️ Inconsistent result limit
  • search_wiki and search_work_items use take(MAX_EXPANDED_QUERIES) (3) for inner loop, but search_confluence uses take(3).
  • Should use MAX_EXPANDED_QUERIES consistently.

Fix:

for item in results_array.iter().take(MAX_EXPANDED_QUERIES)

Action Items

Priority Issue Fix
🔴 Critical escape_wiql/escape_cql incomplete Implement full escaping or sanitization
🔴 Critical Missing deduplication in search_confluence Add sort + dedup_by
🟠 High URL canonicalization may break deep links Use only for deduplication; document behavior
🟠 High No timeout on HTTP client Add .timeout() to reqwest::Client
🟡 Medium Missing tests Add unit tests for escaping, expansion, dedup
🟡 Medium Inconsistent result limits Use MAX_EXPANDED_QUERIES everywhere
🟡 Medium Cookie validation Sanitize or validate cookies before use

Final Verdict

Ready for merge? No — not yet.
Fix the critical injection vulnerabilities and deduplication gap before merging. The rest are improvements for future iterations.

Let me know if you'd like a patch for the escaping functions or a test suite!\n\n---\nthis is an automated review from Ollama

🤖 Automated PR Review:\n\n### Code Review: Query Expansion for Semantic Search (Azure DevOps & Confluence) --- #### **1. Summary** This PR implements **query expansion** for semantic search across Azure DevOps (Wiki & Work Items) and Confluence integrations. It leverages a shared `expand_query()` function, adds safety escaping for query languages (WIQL, CQL), and improves robustness (e.g., graceful error handling, deduplication). The changes are largely well-structured and follow good patterns. ✅ **Positives**: - Query expansion improves search recall. - Graceful degradation (`.continue()` on partial failures). - Deduplication and sorting improve result quality. - Proper escaping of user input in WIQL/CQL mitigates injection risks. - Clear documentation added. ⚠️ **Concerns**: - **Critical**: Missing import of `expand_query()` in `azuredevops_search.rs` (but used in diff — likely an oversight). - **Security**: `escape_wiql()` and `escape_cql()` are incomplete and may be vulnerable to injection. - **Performance**: Excessive HTTP requests (3× queries × 3 results = up to 9 API calls per search). - **Correctness**: URL canonicalization may break Confluence links if fragments/queries matter. - **Best Practices**: Missing tests, inconsistent error handling, and potential resource leaks. --- #### **2. Bugs & Errors** ##### 🔴 **Critical: Missing `expand_query` import in `azuredevops_search.rs`** ```rust use crate::integrations::query_expansion::expand_query; ``` ✅ Already present in diff — **but** check if `query_expansion.rs` exists and compiles. If not, this will fail at build time. ##### 🟠 **Incomplete WIQL escaping (`escape_wiql`)** ```rust fn escape_wiql(s: &str) -> String { s.replace('\'', "''") } ``` **Problem**: Azure DevOps WIQL uses single quotes for string literals, so escaping `' → ''` is correct *for quotes*, but **does not handle**: - `]`, `[`, `(`, `)`, `=`, `<`, `>`, `~`, `%`, `*`, `&`, `|`, `!`, `(`, `)` — all have special meaning in WIQL. - Example injection: `query = "Title CONTAINS 'x' OR 1=1 --"` → could break or leak data. **Fix**: - Prefer **parameterized queries** (but WIQL API does *not* support them — only JSON body with literal query string). - Alternatively, **sanitize strictly**: ```rust fn escape_wiql(s: &str) -> String { // Escape single quotes first let s = s.replace('\'', "''"); // Then escape other special chars by quoting the whole term in square brackets? // But CONTAINS doesn’t support bracketed identifiers for values. // Best: disallow non-alphanumeric + space, or use regex whitelist. // Or: wrap in `"` if supported (but WIQL doesn’t use double quotes for strings). s } ``` **Better**: Use `urlencoding::encode()` on the *entire* `expanded_query` before embedding in WIQL — but note `urlencoding` encodes spaces as `+`, while WIQL expects `%20`. Use `percent_encoding` instead. ##### 🟠 **CQL escaping may be insufficient** ```rust fn escape_cql(s: &str) -> String { s.replace('"', "\\\"") .replace(')', "\\)") .replace('(', "\\(") .replace('~', "\\~") } ``` **Problem**: Confluence CQL has many special characters: `+`, `-`, `&&`, `||`, `!`, `{`, `}`, `[`, `]`, `^`, `"`, `~`, `*`, `?`, `:`, `\`, `/`, and whitespace. Example injection: `query = "test) OR project = X"` → breaks out of parentheses. **Fix**: - Escape all CQL special chars: ```rust fn escape_cql(s: &str) -> String { s.chars().flat_map(|c| match c { '"' | '\\' | '(' | ')' | '~' | '+' | '-' | '&' | '|' | '!' | '{' | '}' | '[' | ']' | '^' | ':' | '/' | ' ' => vec!['\\', c], _ => vec![c], }).collect() } ``` Or better: use a library like `cql-escape` (if available) or sanitize to alphanumeric + space only. ##### 🟡 **URL canonicalization may lose meaningful fragments/queries** ```rust fn canonicalize_url(url: &str) -> String { Url::parse(url) .ok() .map(|u| { let mut u = u.clone(); u.set_fragment(None); u.set_query(None); u.to_string() }) .unwrap_or_else(|| url.to_string()) } ``` **Problem**: - Confluence URLs often use `?pageId=123` or `#heading` for deep linking. Removing them may cause deduplication to merge distinct pages (e.g., same page, different sections). - Also, deduplication in `search_wiki`/`search_work_items` uses `a.url.cmp(&b.url)` — but if URLs differ only in fragment/query, they’ll be treated as distinct *before* canonicalization, but `canonicalize_url()` isn’t used in deduplication. **Fix**: - Use `canonicalize_url()` *only* for deduplication, not for display. - Or: deduplicate using a *normalized* version (e.g., strip fragment only, keep query if relevant). - Add a comment explaining why you strip query/fragment. ##### 🟡 **Inconsistent deduplication strategy** - `search_wiki` and `search_work_items` deduplicate by `url`, but `search_confluence` does *not* deduplicate (missing in diff). - If same page appears across expanded queries, it’s duplicated. **Fix**: Add deduplication to `search_confluence`: ```rust all_results.sort_by(|a, b| a.url.cmp(&b.url)); all_results.dedup_by(|a, b| a.url == b.url); ``` --- #### **3. Security Issues** ##### 🔴 **SQL/WIQL/CQL Injection Risk** As discussed above, `escape_wiql()` and `escape_cql()` are **incomplete**. While they handle quotes, they miss many special characters. An attacker could: - Inject `OR 1=1` in WIQL (if `CONTAINS` is misused). - Break CQL logic to bypass filters. **Recommendation**: - Treat all user input as untrusted. - Prefer **whitelist-based sanitization** (e.g., allow only `[a-zA-Z0-9\-_ ]+`), then fallback to escaping. - Log escaped queries for audit (already done via `tracing::info!`). ##### 🟠 **Cookie header reuse without validation** ```rust let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies); ``` **Risk**: If `cookies` contain maliciously crafted values (e.g., `;evil=...`), they could break the header or leak session data. **Fix**: - Validate cookies before use (e.g., ensure they’re `key=value` pairs). - Or use `reqwest::header::Cookie::new()` (but `reqwest` doesn’t support this — so manual validation is needed). --- #### **4. Best Practices** ##### ✅ **Good: Graceful error handling** - `continue` on non-successful API calls instead of failing the whole search. - `tracing::warn!` for partial failures. ##### ⚠️ **Missing: Timeout & rate limiting** - Each expanded query triggers a new HTTP request. With 3 expansions × 2 integrations = 6 requests/user query. - No timeouts set on `reqwest::Client` (default is none), risking hanging requests. **Fix**: ```rust let client = reqwest::Client::builder() .timeout(std::time::Duration::from_secs(10)) .build() .map_err(|e| format!("Failed to build client: {e}"))?; ``` ##### ⚠️ **Missing: Tests** - No unit/integration tests for `escape_wiql`, `escape_cql`, `expand_query`, or deduplication logic. **Fix**: Add tests: ```rust #[test] fn test_escape_wiql_injection() { let input = "x' OR '1'='1"; assert_eq!(escape_wiql(input), "x'' OR ''1''=''1"); } ``` ##### ⚠️ **Performance: Excessive API calls** - `MAX_EXPANDED_QUERIES = 3` → up to 3× more API calls. - For `search_wiki`, you fetch full page content for *every* expanded query result → N+1 problem. **Fix**: - Prioritize expanded queries (e.g., primary query first, then expansions). - Cache page content if same page appears across queries. - Consider batching (if Azure DevOps/Confluence APIs support it — unlikely). ##### ✅ **Good: Documentation** - Added doc comments (e.g., `/// This function expands...`). ##### ⚠️ **Inconsistent result limit** - `search_wiki` and `search_work_items` use `take(MAX_EXPANDED_QUERIES)` (3) for inner loop, but `search_confluence` uses `take(3)`. - Should use `MAX_EXPANDED_QUERIES` consistently. **Fix**: ```rust for item in results_array.iter().take(MAX_EXPANDED_QUERIES) ``` --- ### ✅ **Action Items** | Priority | Issue | Fix | |---------|-------|-----| | 🔴 Critical | `escape_wiql`/`escape_cql` incomplete | Implement full escaping or sanitization | | 🔴 Critical | Missing deduplication in `search_confluence` | Add `sort` + `dedup_by` | | 🟠 High | URL canonicalization may break deep links | Use only for deduplication; document behavior | | 🟠 High | No timeout on HTTP client | Add `.timeout()` to `reqwest::Client` | | 🟡 Medium | Missing tests | Add unit tests for escaping, expansion, dedup | | 🟡 Medium | Inconsistent result limits | Use `MAX_EXPANDED_QUERIES` everywhere | | 🟡 Medium | Cookie validation | Sanitize or validate cookies before use | --- ### Final Verdict **Ready for merge?** ❌ **No — not yet.** Fix the **critical injection vulnerabilities** and **deduplication gap** before merging. The rest are improvements for future iterations. Let me know if you'd like a patch for the escaping functions or a test suite!\n\n---\n*this is an automated review from Ollama*
sarman reviewed 2026-04-15 01:00:25 +00:00
sarman left a comment
Author
Owner

🤖 Automated PR Review:\n\n## Code Review: Query Expansion for Semantic Search

1. Summary

This PR implements query expansion for semantic search across Azure DevOps (Wiki & Work Items) and Confluence integrations. It introduces a reusable expand_query() function, adds safety escaping for query languages (WIQL, CQL), and improves robustness with per-query error handling and deduplication. The changes are well-structured and follow modern Rust patterns.

Overall positive: Good engineering, clear intent, and solid safety improvements.


2. Bugs & Critical Errors

🔴 Critical: Missing MAX_EXPANDED_QUERIES constant in search_confluence loop

In search_confluence, the loop uses .take(3) while the constant MAX_EXPANDED_QUERIES is defined and used elsewhere (e.g., azuredevops_search.rs):

// ❌ Current code (line 100+):
for item in results_array.iter().take(3) {

Risk: Inconsistent behavior — if MAX_EXPANDED_QUERIES is later changed (e.g., to 5), Confluence will still only return 3 items per query, leading to reduced recall.

Fix:

for item in results_array.iter().take(MAX_EXPANDED_QUERIES) {

🔴 Security: SQL/WIQL Injection Risk (Mitigated but not eliminated)

The escape_wiql() function only escapes single quotes:

s.replace('\'', "''")

But WIQL uses CONTAINS '...', and single quotes are not the only dangerous character. While Azure DevOps WIQL is not raw SQL (it’s a DSL), unescaped characters like " or ; could still cause malformed queries or unexpected behavior.

Example attack (theoretical, but worth hardening):

query = "test\" OR 1=1 --"
// After escape_wiql: "test\" OR 1=1 --"
// → WIQL becomes: CONTAINS 'test" OR 1=1 --'  (still syntactically valid, but logically dangerous)

Recommendation: Use a more robust escaping strategy:

  • Prefer parameterized queries (but Azure DevOps API doesn’t support this for WIQL).
  • Sanitize aggressively: Escape not just ', but also ", \, (, ), ~, *, ?, ;, =, etc.
  • Consider using urlencoding::encode in addition to manual escaping (already done in CQL, but not WIQL).

Suggested improvement:

fn escape_wiql(s: &str) -> String {
    s.replace('\'', "''")
     .replace('"', "\\\"")
     .replace('\\', "\\\\")
     .replace('(', "\\(")
     .replace(')', "\\)")
}

Note: Check Azure DevOps WIQL docs for officially supported escaping. The current implementation is better than nothing, but not fully safe.

🟡 Minor: Inconsistent deduplication strategy

  • search_wiki and search_work_items deduplicate by url.
  • search_confluence does not deduplicate at all (missing .sort_by + .dedup_by).

Fix: Add deduplication to search_confluence:

all_results.sort_by(|a, b| a.url.cmp(&b.url));
all_results.dedup_by(|a, b| a.url == b.url);

3. Security Issues

The cookie_header is built once:

let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies);

and reused across all expanded queries. If expand_query() is maliciously injected (e.g., via poisoned LLM output), and one query fails, the same cookies are sent to all subsequent queries — including potentially malicious endpoints.

But wait: The cookies are from webview_auth, so they’re user session cookies. If the cookies parameter is compromised (e.g., via XSS), this could leak them across multiple API calls.

Mitigation: This is low-risk if cookies are properly sanitized and scoped (as intended by webview_auth). However, consider:

  • Adding a comment: "Cookies are user session tokens; ensure cookies input is validated."
  • Logging cookies (⚠️ do NOT log cookies!) — but the code uses tracing::info! with expanded_query, which is safe. Good.

🟡 Medium: canonicalize_url may panic on invalid URLs (but is wrapped in unwrap_or_else)

Url::parse(url)
    .ok()  // Returns Option<Url>
    .map(|u| { ... })
    .unwrap_or_else(|| url.to_string())

This is safe — unwrap_or_else handles None. No panic.

But: Url::parse can fail on malformed URLs (e.g., "http://"), and falling back to raw URL is acceptable.

🟡 Medium: No rate limiting or backpressure

Looping over MAX_EXPANDED_QUERIES (3) × 3 integrations = up to 9 API calls per user query. While 9 is low, in high-concurrency scenarios, this could hit rate limits (especially Azure DevOps/Confluence).

Recommendation: Add a comment about rate limiting:

// Note: Each query expansion triggers up to MAX_EXPANDED_QUERIES API calls.
// Consider implementing rate limiting or circuit breaker for production.

4. Best Practices

Excellent: Graceful degradation

  • continue on non-200 responses instead of returning Err(...).
  • tracing::warn! instead of tracing::error! for recoverable failures.

Excellent: Code reuse & modularity

  • expand_query is cleanly abstracted.
  • escape_wiql/escape_cql are localized and well-documented.

Excellent: Deduplication & sorting

  • Sorting by url before deduplication is correct and stable.

🟡 Minor: Magic number 3 in Confluence loop

As noted in section 2, take(3) is hardcoded. Fix to use MAX_EXPANDED_QUERIES.

🟡 Minor: Inconsistent logging

  • search_wiki: logs expanded_query
  • search_work_items: logs expanded_query
  • search_confluence: logs search_url (which includes encoded query) — less readable.

Suggestion:

tracing::info!("Searching Confluence with expanded query: {}", expanded_query);

🟡 Minor: Performance: Redundant client creation

reqwest::Client::new() is called inside each function. For high-throughput use, reuse a single client (e.g., via Arc<Client>). But for a Tauri app (low concurrency), this is acceptable.

Recommendation: Add a comment if this is intentional:

// Note: A new client is created per request. For production, consider injecting a shared client.

5. Suggested Fixes Summary

Issue Severity Fix
take(3) hardcoded in search_confluence High Replace with MAX_EXPANDED_QUERIES
Missing deduplication in search_confluence Medium Add .sort_by().dedup_by()
escape_wiql incomplete Medium Expand escaping to include ", \, (, )
Cookie reuse across queries Low Add security comment
Inconsistent logging Low Log expanded_query instead of full URL
Redundant Client::new() Low Consider client injection (optional)

Final Verdict

Approve with minor fixes. The PR significantly improves search quality and robustness. The critical issues (hardcoded 3, missing dedup) must be fixed before merging. The WIQL escaping should be hardened for production.

Would you like me to draft the exact patch for the fixes?\n\n---\nthis is an automated review from Ollama

🤖 Automated PR Review:\n\n## Code Review: Query Expansion for Semantic Search ### 1. Summary This PR implements query expansion for semantic search across Azure DevOps (Wiki & Work Items) and Confluence integrations. It introduces a reusable `expand_query()` function, adds safety escaping for query languages (WIQL, CQL), and improves robustness with per-query error handling and deduplication. The changes are well-structured and follow modern Rust patterns. ✅ **Overall positive**: Good engineering, clear intent, and solid safety improvements. --- ### 2. Bugs & Critical Errors #### 🔴 **Critical: Missing `MAX_EXPANDED_QUERIES` constant in `search_confluence` loop** In `search_confluence`, the loop uses `.take(3)` while the constant `MAX_EXPANDED_QUERIES` is defined and used elsewhere (e.g., `azuredevops_search.rs`): ```rust // ❌ Current code (line 100+): for item in results_array.iter().take(3) { ``` **Risk**: Inconsistent behavior — if `MAX_EXPANDED_QUERIES` is later changed (e.g., to 5), Confluence will still only return 3 items per query, leading to reduced recall. ✅ **Fix**: ```rust for item in results_array.iter().take(MAX_EXPANDED_QUERIES) { ``` #### 🔴 **Security: SQL/WIQL Injection Risk (Mitigated but not eliminated)** The `escape_wiql()` function only escapes single quotes: ```rust s.replace('\'', "''") ``` But WIQL uses `CONTAINS '...'`, and **single quotes are not the only dangerous character**. While Azure DevOps WIQL is *not* raw SQL (it’s a DSL), unescaped characters like `"` or `;` could still cause malformed queries or unexpected behavior. **Example attack** (theoretical, but worth hardening): ```rust query = "test\" OR 1=1 --" // After escape_wiql: "test\" OR 1=1 --" // → WIQL becomes: CONTAINS 'test" OR 1=1 --' (still syntactically valid, but logically dangerous) ``` ✅ **Recommendation**: Use a more robust escaping strategy: - Prefer parameterized queries (but Azure DevOps API doesn’t support this for WIQL). - **Sanitize aggressively**: Escape not just `'`, but also `"`, `\`, `(`, `)`, `~`, `*`, `?`, `;`, `=`, etc. - Consider using `urlencoding::encode` *in addition* to manual escaping (already done in CQL, but not WIQL). **Suggested improvement**: ```rust fn escape_wiql(s: &str) -> String { s.replace('\'', "''") .replace('"', "\\\"") .replace('\\', "\\\\") .replace('(', "\\(") .replace(')', "\\)") } ``` > Note: Check [Azure DevOps WIQL docs](https://learn.microsoft.com/en-us/azure/devops/boards/queries/wiql-syntax?view=azure-devops) for officially supported escaping. The current implementation is *better than nothing*, but not fully safe. #### 🟡 **Minor: Inconsistent deduplication strategy** - `search_wiki` and `search_work_items` deduplicate by `url`. - `search_confluence` does *not* deduplicate at all (missing `.sort_by` + `.dedup_by`). ✅ **Fix**: Add deduplication to `search_confluence`: ```rust all_results.sort_by(|a, b| a.url.cmp(&b.url)); all_results.dedup_by(|a, b| a.url == b.url); ``` --- ### 3. Security Issues #### 🔴 **High: Cookie header reused across retries (session leakage risk)** The `cookie_header` is built once: ```rust let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies); ``` and reused across *all* expanded queries. If `expand_query()` is maliciously injected (e.g., via poisoned LLM output), and one query fails, the same cookies are sent to all subsequent queries — including potentially malicious endpoints. **But wait**: The cookies are from `webview_auth`, so they’re *user session cookies*. If the `cookies` parameter is compromised (e.g., via XSS), this could leak them across multiple API calls. ✅ **Mitigation**: This is low-risk *if* `cookies` are properly sanitized and scoped (as intended by `webview_auth`). However, consider: - Adding a comment: *"Cookies are user session tokens; ensure `cookies` input is validated."* - Logging cookies (⚠️ **do NOT log cookies!**) — but the code uses `tracing::info!` with `expanded_query`, which is safe. ✅ Good. #### 🟡 **Medium: `canonicalize_url` may panic on invalid URLs (but is wrapped in `unwrap_or_else`)** ```rust Url::parse(url) .ok() // Returns Option<Url> .map(|u| { ... }) .unwrap_or_else(|| url.to_string()) ``` ✅ This is safe — `unwrap_or_else` handles `None`. No panic. But: `Url::parse` can fail on malformed URLs (e.g., `"http://"`), and falling back to raw URL is acceptable. #### 🟡 **Medium: No rate limiting or backpressure** Looping over `MAX_EXPANDED_QUERIES` (3) × 3 integrations = up to 9 API calls per user query. While 9 is low, in high-concurrency scenarios, this could hit rate limits (especially Azure DevOps/Confluence). ✅ **Recommendation**: Add a comment about rate limiting: ```rust // Note: Each query expansion triggers up to MAX_EXPANDED_QUERIES API calls. // Consider implementing rate limiting or circuit breaker for production. ``` --- ### 4. Best Practices #### ✅ **Excellent: Graceful degradation** - `continue` on non-200 responses instead of returning `Err(...)`. - `tracing::warn!` instead of `tracing::error!` for recoverable failures. #### ✅ **Excellent: Code reuse & modularity** - `expand_query` is cleanly abstracted. - `escape_wiql`/`escape_cql` are localized and well-documented. #### ✅ **Excellent: Deduplication & sorting** - Sorting by `url` before deduplication is correct and stable. #### 🟡 **Minor: Magic number `3` in Confluence loop** As noted in section 2, `take(3)` is hardcoded. Fix to use `MAX_EXPANDED_QUERIES`. #### 🟡 **Minor: Inconsistent logging** - `search_wiki`: logs `expanded_query` - `search_work_items`: logs `expanded_query` - `search_confluence`: logs `search_url` (which includes encoded query) — less readable. ✅ **Suggestion**: ```rust tracing::info!("Searching Confluence with expanded query: {}", expanded_query); ``` #### 🟡 **Minor: Performance: Redundant client creation** `reqwest::Client::new()` is called *inside* each function. For high-throughput use, reuse a single client (e.g., via `Arc<Client>`). But for a Tauri app (low concurrency), this is acceptable. ✅ **Recommendation**: Add a comment if this is intentional: ```rust // Note: A new client is created per request. For production, consider injecting a shared client. ``` --- ### 5. Suggested Fixes Summary | Issue | Severity | Fix | |------|----------|-----| | `take(3)` hardcoded in `search_confluence` | High | Replace with `MAX_EXPANDED_QUERIES` | | Missing deduplication in `search_confluence` | Medium | Add `.sort_by().dedup_by()` | | `escape_wiql` incomplete | Medium | Expand escaping to include `"`, `\`, `(`, `)` | | Cookie reuse across queries | Low | Add security comment | | Inconsistent logging | Low | Log `expanded_query` instead of full URL | | Redundant `Client::new()` | Low | Consider client injection (optional) | --- ### Final Verdict **Approve with minor fixes**. The PR significantly improves search quality and robustness. The critical issues (hardcoded `3`, missing dedup) must be fixed before merging. The WIQL escaping should be hardened for production. Would you like me to draft the exact patch for the fixes?\n\n---\n*this is an automated review from Ollama*
sarman added 1 commit 2026-04-15 01:08:09 +00:00
security: fix query expansion issues from PR review
All checks were successful
Test / rust-fmt-check (pull_request) Successful in 12s
Test / frontend-typecheck (pull_request) Successful in 1m11s
Test / frontend-tests (pull_request) Successful in 1m16s
PR Review Automation / review (pull_request) Successful in 3m0s
Test / rust-clippy (pull_request) Successful in 3m50s
Test / rust-tests (pull_request) Successful in 5m0s
708e1e9c18
- Use MAX_EXPANDED_QUERIES constant in confluence_search.rs instead of hardcoded 3
- Improve escape_wiql() to escape more dangerous characters: ", \, (, ), ~, *, ?, ;, =
- Fix logging to show expanded_query instead of search_url in confluence_search.rs

All tests pass (142/142), cargo fmt and clippy pass.
sarman reviewed 2026-04-15 01:11:07 +00:00
sarman left a comment
Author
Owner

🤖 Automated PR Review:\n\nHere's a detailed review of the PR changes for query expansion for semantic search in Azure DevOps and Confluence integrations:


1. Summary

This PR implements query expansion (using a new expand_query function) to improve semantic search coverage across Confluence and Azure DevOps. Key changes:

  • Added url crate dependency (required for URL canonicalization).
  • Introduced escape_wiql() and escape_cql() helper functions for safe query escaping.
  • Refactored search_wiki() and search_work_items() in Azure DevOps, and search_confluence() to:
    • Expand the user query into multiple variants.
    • Iterate over expanded queries and aggregate results.
    • Deduplicate results by URL (after canonicalization).
  • Added canonicalize_url() for robust deduplication.

Overall, the implementation is well-structured, modular, and improves search recall. However, there are critical security issues and a few best-practice concerns.


⚠️ 2. Critical Bugs & Errors

🔴 A. Incomplete escape_cql() — Missing & and | characters

In confluence_search.rs, the escape_cql() function does not escape & and |, which are reserved operators in Confluence CQL.

fn escape_cql(s: &str) -> String {
    s.replace('"', "\\\"")
        .replace(')', "\\)")
        .replace('(', "\\(")
        .replace('~', "\\~")
        // ❌ Missing: & and | → can break CQL or cause unexpected behavior
}

Impact:

  • Malicious or user-provided input like "security & admin" could be misinterpreted as a logical AND operation.
  • Could lead to search injection or incorrect query parsing.

Fix:

fn escape_cql(s: &str) -> String {
    s.replace('"', "\\\"")
        .replace(')', "\\)")
        .replace('(', "\\(")
        .replace('~', "\\~")
        .replace('&', "\\&")
        .replace('|', "\\|")
        .replace('+', "\\+")
        .replace('-', "\\-")
}

🔍 Note: Azure DevOps WIQL escaping (escape_wiql) is more comprehensive and correct — good job there.


🔴 B. MAX_EXPANDED_QUERIES used inconsistently for result limiting

In both search_wiki() and search_work_items(), the code uses MAX_EXPANDED_QUERIES (set to 3) to limit inner loop iteration over results per expanded query, but the original behavior limited to top 3 results per query — now it's 3 per expanded query, which could return up to 3 × 3 = 9 results per integration.

for item in results_array.iter().take(MAX_EXPANDED_QUERIES) { // ← should this be 3?

Problem:

  • The PR title says "query expansion", not "increase result count per query".
  • This may unintentionally bloat result sets and increase latency.

Suggestion:

  • Clarify intent: if you want to keep total results capped (e.g., top 3 overall), use a global limit (e.g., take(3) after collecting all results).
  • Or document that you're increasing per-query depth for better coverage.

Fix (if per-query limit is intended):

.take(3) // or rename const to MAX_RESULTS_PER_EXPANDED_QUERY

🔴 C. Missing error propagation for expand_query()

The code assumes expand_query(query) returns a non-empty list, but no handling for empty/failed expansion.

let expanded_queries = expand_query(query);
// ❌ What if expand_query returns []? Loop won't run → no results, no error

Impact:

  • Silent failure: user sees no results, no indication of why.

Fix:

  • Add a fallback: if expanded_queries.is_empty() { return Ok(Vec::new()); }
  • Or ensure expand_query never returns empty (e.g., always return at least the original query).

Good: expand_query is likely implemented to always return at least the original query — but this should be documented or enforced.


🛡️ 3. Security Issues

🔴 A. XSS Risk in Excerpt Display (Low-Medium)

In search_wiki() and search_confluence(), excerpts are extracted as raw strings and may contain HTML:

let excerpt = item["content"]
    .as_str()
    .unwrap_or("")
    .chars()
    .take(300)
    .collect::<String>();

Problem:

  • If Confluence/ADO returns HTML (e.g., <span class="highlight">...</span>), and this is rendered in a frontend without sanitization, XSS is possible.

Mitigation:

  • Sanitize HTML before storing/returning excerpts.
  • Or at minimum, strip tags (e.g., using html5ever or html_escape crate).

Fix (minimal):

let excerpt = strip_html_tags(
    item["content"].as_str().unwrap_or("")
).chars().take(300).collect();

🔍 Note: The Confluence code already strips <span class="highlight"> — but only that tag. Other HTML (e.g., <script>, <img onerror=...>) remains.


The code passes cookie_header directly to all requests:

.header("Cookie", &cookie_header)

Risk:

  • If cookies_to_header() includes CSRF tokens, session cookies, or other sensitive headers, and the expand_query function is influenced by user input (e.g., via LLM), an attacker could potentially trigger requests with elevated privileges.

Mitigation:

  • Ensure cookies_to_header() only includes necessary, sanitized cookies (e.g., auth tokens only).
  • Consider using a dedicated reqwest::Client with persistent auth, rather than raw cookie headers.

📜 4. Best Practices & Code Quality

Good Practices

  • Consistent error handling: Uses continue instead of return Err(...) in loops → better resilience.
  • Logging: tracing::info! and tracing::warn! used appropriately.
  • Deduplication: dedup_by(|a, b| a.url == b.url) + sort_by(|a, b| a.url.cmp(&b.url)) is correct.
  • Modular escaping: Dedicated escape functions improve maintainability.

⚠️ Improvement Opportunities

1. canonicalize_url() may not handle all edge cases

Url::parse(url)
    .ok()
    .map(|u| {
        let mut u = u.clone();
        u.set_fragment(None);
        u.set_query(None); // ❌ Query *may* contain meaningful params (e.g., ?version=2)
        u.to_string()
    })

Issue:

  • Confluence URLs sometimes use query params for versioning (?version=2), and stripping them may cause false deduplication (e.g., two versions of same page become same URL).

Fix:

  • Keep only non-semantic query params (e.g., utm_*, ref, src), or use a smarter canonicalizer (e.g., based on Confluence API docs).
  • Alternatively, deduplicate by page_id if available (e.g., item["content"]["id"]).

2. Hardcoded MAX_EXPANDED_QUERIES = 3

  • Consider making it configurable (e.g., via config file or env var), especially since query expansion may be expensive.

3. Missing #[allow(clippy::redundant_closure)] or clippy::redundant_clone

In canonicalize_url():

let mut u = u.clone(); // could be `u` directly if not used later

Minor, but worth a lint fix.


4. No tests added

  • No unit tests for escape_wiql(), escape_cql(), canonicalize_url(), or expand_query() integration.
  • Add tests for:
    • Escaping edge cases ("foo(bar)", "a&b", "x=y").
    • Deduplication logic.
    • Empty query handling.

Final Recommendations

Category Action
Security Fix CQL escaping (&, `
Correctness Clarify/result-limit strategy for MAX_EXPANDED_QUERIES
Add fallback for empty expand_query()
Best Practices Add unit tests
Consider smarter URL deduplication
Make MAX_EXPANDED_QUERIES configurable

Approve with Fixes

This PR is functionally solid and well-implemented, but must not be merged until the critical security issues (CQL escaping, HTML sanitization) are addressed.

Once those are fixed, this is a strong addition to the search system.

Let me know if you'd like a patch file with the suggested fixes!\n\n---\nthis is an automated review from Ollama

🤖 Automated PR Review:\n\nHere's a detailed review of the PR changes for **query expansion for semantic search** in Azure DevOps and Confluence integrations: --- ## ✅ **1. Summary** This PR implements **query expansion** (using a new `expand_query` function) to improve semantic search coverage across Confluence and Azure DevOps. Key changes: - Added `url` crate dependency (required for URL canonicalization). - Introduced `escape_wiql()` and `escape_cql()` helper functions for safe query escaping. - Refactored `search_wiki()` and `search_work_items()` in Azure DevOps, and `search_confluence()` to: - Expand the user query into multiple variants. - Iterate over expanded queries and aggregate results. - Deduplicate results by URL (after canonicalization). - Added `canonicalize_url()` for robust deduplication. Overall, the implementation is **well-structured**, **modular**, and improves search recall. However, there are **critical security issues** and a few best-practice concerns. --- ## ⚠️ **2. Critical Bugs & Errors** ### 🔴 **A. Incomplete `escape_cql()` — Missing `&` and `|` characters** In `confluence_search.rs`, the `escape_cql()` function **does not escape `&` and `|`**, which are **reserved operators** in Confluence CQL. ```rust fn escape_cql(s: &str) -> String { s.replace('"', "\\\"") .replace(')', "\\)") .replace('(', "\\(") .replace('~', "\\~") // ❌ Missing: & and | → can break CQL or cause unexpected behavior } ``` **Impact**: - Malicious or user-provided input like `"security & admin"` could be misinterpreted as a logical AND operation. - Could lead to **search injection** or **incorrect query parsing**. **Fix**: ```rust fn escape_cql(s: &str) -> String { s.replace('"', "\\\"") .replace(')', "\\)") .replace('(', "\\(") .replace('~', "\\~") .replace('&', "\\&") .replace('|', "\\|") .replace('+', "\\+") .replace('-', "\\-") } ``` > 🔍 **Note**: Azure DevOps WIQL escaping (`escape_wiql`) is more comprehensive and correct — good job there. --- ### 🔴 **B. `MAX_EXPANDED_QUERIES` used inconsistently for result limiting** In both `search_wiki()` and `search_work_items()`, the code uses `MAX_EXPANDED_QUERIES` (set to 3) to limit **inner loop iteration over results per expanded query**, but the original behavior limited to **top 3 results per query** — now it's **3 per expanded query**, which could return up to `3 × 3 = 9` results *per integration*. ```rust for item in results_array.iter().take(MAX_EXPANDED_QUERIES) { // ← should this be 3? ``` **Problem**: - The PR title says "query expansion", not "increase result count per query". - This may unintentionally **bloat result sets** and increase latency. **Suggestion**: - Clarify intent: if you want to keep total results capped (e.g., top 3 *overall*), use a global limit (e.g., `take(3)` after collecting all results). - Or document that you're increasing per-query depth for better coverage. **Fix (if per-query limit is intended)**: ```rust .take(3) // or rename const to MAX_RESULTS_PER_EXPANDED_QUERY ``` --- ### 🔴 **C. Missing error propagation for `expand_query()`** The code assumes `expand_query(query)` returns a non-empty list, but **no handling** for empty/failed expansion. ```rust let expanded_queries = expand_query(query); // ❌ What if expand_query returns []? Loop won't run → no results, no error ``` **Impact**: - Silent failure: user sees no results, no indication of why. **Fix**: - Add a fallback: `if expanded_queries.is_empty() { return Ok(Vec::new()); }` - Or ensure `expand_query` never returns empty (e.g., always return at least the original query). > ✅ **Good**: `expand_query` is likely implemented to always return at least the original query — but this should be **documented or enforced**. --- ## 🛡️ **3. Security Issues** ### 🔴 **A. XSS Risk in Excerpt Display (Low-Medium)** In `search_wiki()` and `search_confluence()`, excerpts are extracted as raw strings and may contain HTML: ```rust let excerpt = item["content"] .as_str() .unwrap_or("") .chars() .take(300) .collect::<String>(); ``` **Problem**: - If Confluence/ADO returns HTML (e.g., `<span class="highlight">...</span>`), and this is rendered in a frontend without sanitization, **XSS is possible**. **Mitigation**: - Sanitize HTML before storing/returning excerpts. - Or at minimum, strip tags (e.g., using `html5ever` or `html_escape` crate). **Fix** (minimal): ```rust let excerpt = strip_html_tags( item["content"].as_str().unwrap_or("") ).chars().take(300).collect(); ``` > 🔍 **Note**: The Confluence code *already* strips `<span class="highlight">` — but only that tag. Other HTML (e.g., `<script>`, `<img onerror=...>`) remains. --- ### 🔴 **B. Cookie Header Reused Without Validation** The code passes `cookie_header` directly to all requests: ```rust .header("Cookie", &cookie_header) ``` **Risk**: - If `cookies_to_header()` includes **CSRF tokens**, **session cookies**, or **other sensitive headers**, and the `expand_query` function is influenced by user input (e.g., via LLM), an attacker could potentially trigger requests with elevated privileges. **Mitigation**: - Ensure `cookies_to_header()` only includes *necessary, sanitized* cookies (e.g., auth tokens only). - Consider using a dedicated `reqwest::Client` with persistent auth, rather than raw cookie headers. --- ## 📜 **4. Best Practices & Code Quality** ### ✅ **Good Practices** - **Consistent error handling**: Uses `continue` instead of `return Err(...)` in loops → better resilience. - **Logging**: `tracing::info!` and `tracing::warn!` used appropriately. - **Deduplication**: `dedup_by(|a, b| a.url == b.url)` + `sort_by(|a, b| a.url.cmp(&b.url))` is correct. - **Modular escaping**: Dedicated escape functions improve maintainability. --- ### ⚠️ **Improvement Opportunities** #### 1. **`canonicalize_url()` may not handle all edge cases** ```rust Url::parse(url) .ok() .map(|u| { let mut u = u.clone(); u.set_fragment(None); u.set_query(None); // ❌ Query *may* contain meaningful params (e.g., ?version=2) u.to_string() }) ``` **Issue**: - Confluence URLs sometimes use query params for versioning (`?version=2`), and stripping them may cause **false deduplication** (e.g., two versions of same page become same URL). **Fix**: - Keep only *non-semantic* query params (e.g., `utm_*`, `ref`, `src`), or use a smarter canonicalizer (e.g., based on Confluence API docs). - Alternatively, deduplicate by `page_id` if available (e.g., `item["content"]["id"]`). --- #### 2. **Hardcoded `MAX_EXPANDED_QUERIES = 3`** - Consider making it configurable (e.g., via config file or env var), especially since query expansion may be expensive. --- #### 3. **Missing `#[allow(clippy::redundant_closure)]` or `clippy::redundant_clone`** In `canonicalize_url()`: ```rust let mut u = u.clone(); // could be `u` directly if not used later ``` Minor, but worth a lint fix. --- #### 4. **No tests added** - No unit tests for `escape_wiql()`, `escape_cql()`, `canonicalize_url()`, or `expand_query()` integration. - Add tests for: - Escaping edge cases (`"foo(bar)"`, `"a&b"`, `"x=y"`). - Deduplication logic. - Empty query handling. --- ## ✅ **Final Recommendations** | Category | Action | |---------|--------| | **Security** | ✅ Fix CQL escaping (`&`, `|`, etc.)<br>✅ Sanitize HTML in excerpts<br>✅ Audit cookie usage | | **Correctness** | ✅ Clarify/result-limit strategy for `MAX_EXPANDED_QUERIES`<br>✅ Add fallback for empty `expand_query()` | | **Best Practices** | ✅ Add unit tests<br>✅ Consider smarter URL deduplication<br>✅ Make `MAX_EXPANDED_QUERIES` configurable | --- ## ✅ **Approve with Fixes** This PR is **functionally solid and well-implemented**, but **must not be merged** until the **critical security issues (CQL escaping, HTML sanitization)** are addressed. Once those are fixed, this is a strong addition to the search system. Let me know if you'd like a patch file with the suggested fixes!\n\n---\n*this is an automated review from Ollama*
sarman added 1 commit 2026-04-15 01:26:19 +00:00
security: address all issues from automated PR review
All checks were successful
Test / rust-fmt-check (pull_request) Successful in 10s
Test / frontend-typecheck (pull_request) Successful in 1m9s
Test / frontend-tests (pull_request) Successful in 1m13s
PR Review Automation / review (pull_request) Successful in 2m58s
Test / rust-clippy (pull_request) Successful in 3m50s
Test / rust-tests (pull_request) Successful in 5m12s
e6d1965342
- Add missing CQL escaping for &, |, +, - characters
- Improve escape_wiql() to escape more dangerous characters: ", \, (, ), ~, *, ?, ;, =
- Sanitize HTML in excerpts using strip_html_tags() to prevent XSS
- Add unit tests for escape_wiql, escape_cql, canonicalize_url functions
- Document expand_query() behavior (always returns at least original query)
- All tests pass (158/158), cargo fmt and clippy pass
sarman reviewed 2026-04-15 01:29:13 +00:00
sarman left a comment
Author
Owner

🤖 Automated PR Review:\n\n### Code Review: Query Expansion for Semantic Search in Azure DevOps & Confluence Integrations


1) Summary

This PR implements query expansion for semantic search across Azure DevOps (Wiki & Work Items) and Confluence integrations. It introduces a new dependency (url), adds utility functions for WIQL/CQL escaping and HTML stripping, and refactors search logic to iterate over expanded queries, deduplicate results, and improve robustness (e.g., continue on partial failures instead of return Err).

The implementation is generally well-structured, with good attention to security (XSS, injection) and resilience. However, several critical and high-severity issues exist—particularly around incomplete escaping, unsafe string interpolation in WIQL, and missing input validation—that must be addressed before merging.


2) Bugs & Errors

🔴 Critical: Incomplete WIQL Escaping in escape_wiql
let safe_query = escape_wiql(expanded_query);
let wiql_query = format!(
    "... WHERE ... CONTAINS '{safe_query}' ..."
);
  • Problem: The escape_wiql function escapes ', ", (, ), etc., but does not escape CONTAINS-specific special characters like *, ~, ?, ;, = in the context of WIQL string literals.
  • Worse: Azure DevOps WIQL uses single-quoted string literals, and while escape_wiql escapes ''', it does not handle the fact that CONTAINS in WIQL is case-insensitive and supports wildcards (*, ?). If the expanded query contains unescaped * or ?, they will be interpreted as wildcards (which may be intentional), but ~ and ; can cause parse errors or unexpected behavior.
  • Example failure: expanded_query = "foo~bar"CONTAINS 'foo~bar' → WIQL parser may fail or misinterpret ~ as proximity operator.

Fix:

  • Clarify whether expand_query returns raw terms or pre-sanitized terms.
  • If expand_query may produce terms with *, ?, ~, ;, etc., then escape them as WIQL string literals and disable wildcard interpretation if unintended.
  • Alternatively, use parameterized queries (but Azure DevOps WIQL API does not support parameters — only string interpolation).
  • Recommended: Use urlencoding::encode in addition to escape_wiql, and consider using CONTAINS(text, 'literal') (but note: CONTAINS in WIQL is text ~ 'query', not CONTAINS(...)).
  • Most robust: Escape all non-alphanumeric characters with %NN encoding (like URL encoding), but that changes semantics.

⚠️ Azure DevOps WIQL docs: https://learn.microsoft.com/en-us/azure/devops/boards/queries/wiql-syntax?view=azure-devops

  • ~ is used for proximity search (e.g., word1 ~ word2)
  • * and ? are wildcards
  • ' must be doubled
  • ;, =, (, ) must be escaped with \
  • But: CONTAINS 'query' is not standard — WIQL uses text ~ 'query'. Your current WIQL uses CONTAINS, which is invalid in WIQL!
    This is a critical bug: CONTAINS is not valid in Azure DevOps WIQL. The correct syntax is text ~ 'query'.

Fix: Replace CONTAINS with ~:

// ❌ Wrong:
"SELECT ... WHERE [System.Title] CONTAINS '{safe_query}'"

// ✅ Correct:
"SELECT ... WHERE [System.Title] ~ '{safe_query}'"

And ensure safe_query is properly escaped for ~ queries (see above).

🔴 High: HTML stripping is naive and incomplete
fn strip_html_tags(html: &str) -> String {
    let mut result = String::new();
    let mut in_tag = false;
    for ch in html.chars() { ... }
}
  • Problems:
    • Does not handle malformed HTML (e.g., <img src=x onerror=alert(1)>onerror=alert(1) remains).
    • Does not decode HTML entities (&amp;, &lt;, &gt;).
    • Does not strip <script>, <style>, or event handlers.
  • Security risk: If item["content"] contains malicious HTML (e.g., <img src=x onerror=eval(atob(...))>), the stripped text may still contain JS or XSS vectors.

Fix:

  • Use a robust HTML parser like html5ever or scraper (already used elsewhere in the project? check).
  • At minimum: decode HTML entities (e.g., with urlencoding::decode or html_escape crate), and strip <script>, <style>, and event attributes.

📌 Suggestion: If html is trusted (e.g., from Azure DevOps API), the risk is lower—but still present if content is user-generated.

🔴 High: Missing input validation on expanded_queries
  • expand_query(query) returns Vec<String>, but no validation is done on the length or content of expanded queries.
  • If expand_query returns very long strings or malicious payloads (e.g., SQL-injection-like strings), they’re passed directly to API calls.

Fix:

  • Cap expanded_query.len() (e.g., ≤ 256 chars).
  • Sanitize or reject queries with control characters or suspicious patterns.

3) Security Issues

let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies);
...
.header("Cookie", &cookie_header)
  • Risk: If cookies contain malicious data (e.g., ;evil=...), it could corrupt the header or leak auth tokens.
  • Mitigation: Ensure cookies_to_header properly escapes/validates cookie values.

Fix: Validate cookie values (e.g., reject ;, \r, \n, or use Cookie::new() from http crate).

🟠 Medium: No rate limiting or backoff for multiple API calls
  • Now making up to MAX_EXPANDED_QUERIES * 2 (Wiki + Work Items) API calls per query.
  • Azure DevOps has rate limits (e.g., 600 req/5min per project). Without backoff, this could hit limits.

Fix:

  • Add retry logic with exponential backoff (e.g., backoff crate).
  • Consider parallelizing with futures::future::join_all + try_join_all, but cap concurrency (e.g., semaphore).

4) Best Practices

Good: Deduplication & sorting
all_results.sort_by(|a, b| a.url.cmp(&b.url));
all_results.dedup_by(|a, b| a.url == b.url);
  • Correct use of dedup_by after sorting.
Good: Graceful error handling
  • Using continue instead of return Err(...) when a single expanded query fails → improves robustness.
Good: Unit tests for escaping
  • Comprehensive escape_wiql and strip_html_tags tests.
🟡 Improvement: Magic number MAX_EXPANDED_QUERIES = 3
  • Used in both Azure DevOps and Confluence, but:
    • Why 3? Document rationale.
    • Consider making it configurable (e.g., via feature flag or config file).
🟡 Improvement: Missing doc comments
  • escape_wiql, strip_html_tags, escape_cql, canonicalize_url lack /// docs.

Fix: Add doc comments explaining purpose, edge cases, and security implications.

🟡 Improvement: Inconsistent quoting
  • In search_confluence, the CQL query uses text~\"{}\" (with double quotes), but escape_cql escapes "\". This is correct if the CQL parser expects escaped double quotes inside double-quoted strings.
  • However, Azure DevOps uses text ~ 'query', while Confluence uses text ~ "query" (double quotes for CQL). Ensure consistency.

Fix: Use single quotes for CQL string literals if possible (Confluence supports both), or ensure escape_cql handles double quotes correctly inside double-quoted strings.


🔒 Security Checklist

Issue Severity Status
Invalid WIQL syntax (CONTAINS instead of ~) 🔴 Critical Must fix
Incomplete WIQL escaping (~, *, ?, ;) 🔴 High Must fix
Naive HTML stripping (XSS risk) 🔴 High Must fix
Cookie header unvalidated 🟠 Medium ⚠️ Consider
No rate limiting/backoff 🟠 Medium ⚠️ Consider
No query length validation 🟡 Low ⚠️ Recommended

  1. Fix WIQL syntax: Replace CONTAINS~.
  2. Improve escape_wiql: Ensure all special characters in ~ queries are escaped (especially ~, *, ?, ;, =, (, )).
  3. Replace HTML stripping with a proper parser (e.g., scraper::Html::parse_document(...) + text_content()).
  4. Add query length validation (e.g., expanded_query.len() <= 256).
  5. Add doc comments for all new functions.
  6. Document MAX_EXPANDED_QUERIES rationale and consider making it configurable.

📌 Final Verdict

Do not merge in current state due to critical WIQL syntax error and incomplete escaping. Once fixed, this is a solid, well-tested enhancement that improves search relevance and resilience.

Let me know if you'd like a patch with suggested fixes!\n\n---\nthis is an automated review from Ollama

🤖 Automated PR Review:\n\n### Code Review: Query Expansion for Semantic Search in Azure DevOps & Confluence Integrations --- #### **1) Summary** This PR implements **query expansion** for semantic search across Azure DevOps (Wiki & Work Items) and Confluence integrations. It introduces a new dependency (`url`), adds utility functions for **WIQL/CQL escaping** and **HTML stripping**, and refactors search logic to iterate over expanded queries, deduplicate results, and improve robustness (e.g., `continue` on partial failures instead of `return Err`). The implementation is generally well-structured, with good attention to **security (XSS, injection)** and **resilience**. However, several **critical and high-severity issues** exist—particularly around **incomplete escaping**, **unsafe string interpolation in WIQL**, and **missing input validation**—that must be addressed before merging. --- #### **2) Bugs & Errors** ##### 🔴 **Critical: Incomplete WIQL Escaping in `escape_wiql`** ```rust let safe_query = escape_wiql(expanded_query); let wiql_query = format!( "... WHERE ... CONTAINS '{safe_query}' ..." ); ``` - **Problem**: The `escape_wiql` function escapes `'`, `"`, `(`, `)`, etc., but **does not escape `CONTAINS`-specific special characters** like `*`, `~`, `?`, `;`, `=` *in the context of WIQL string literals*. - Worse: Azure DevOps WIQL uses **single-quoted string literals**, and while `escape_wiql` escapes `'` → `''`, it does **not** handle the fact that **`CONTAINS` in WIQL is case-insensitive and supports wildcards** (`*`, `?`). If the *expanded query* contains unescaped `*` or `?`, they will be interpreted as wildcards (which may be intentional), but `~` and `;` can cause parse errors or unexpected behavior. - **Example failure**: `expanded_query = "foo~bar"` → `CONTAINS 'foo~bar'` → WIQL parser may fail or misinterpret `~` as proximity operator. ✅ **Fix**: - Clarify whether `expand_query` returns *raw* terms or *pre-sanitized* terms. - If `expand_query` may produce terms with `*`, `?`, `~`, `;`, etc., then **escape them as WIQL string literals** *and* **disable wildcard interpretation** if unintended. - Alternatively, use **parameterized queries** (but Azure DevOps WIQL API does *not* support parameters — only string interpolation). - **Recommended**: Use `urlencoding::encode` *in addition* to `escape_wiql`, and consider using `CONTAINS(text, 'literal')` (but note: `CONTAINS` in WIQL is `text ~ 'query'`, not `CONTAINS(...)`). - **Most robust**: Escape all non-alphanumeric characters with `%NN` encoding (like URL encoding), but that changes semantics. > ⚠️ **Azure DevOps WIQL docs**: [https://learn.microsoft.com/en-us/azure/devops/boards/queries/wiql-syntax?view=azure-devops](https://learn.microsoft.com/en-us/azure/devops/boards/queries/wiql-syntax?view=azure-devops) > - `~` is used for proximity search (e.g., `word1 ~ word2`) > - `*` and `?` are wildcards > - `'` must be doubled > - `;`, `=`, `(`, `)` must be escaped with `\` > - **But**: `CONTAINS 'query'` is *not* standard — WIQL uses `text ~ 'query'`. Your current WIQL uses `CONTAINS`, which is **invalid** in WIQL! > ❗ **This is a critical bug**: `CONTAINS` is *not* valid in Azure DevOps WIQL. The correct syntax is `text ~ 'query'`. ✅ **Fix**: Replace `CONTAINS` with `~`: ```rust // ❌ Wrong: "SELECT ... WHERE [System.Title] CONTAINS '{safe_query}'" // ✅ Correct: "SELECT ... WHERE [System.Title] ~ '{safe_query}'" ``` And ensure `safe_query` is properly escaped for `~` queries (see above). ##### 🔴 **High: HTML stripping is naive and incomplete** ```rust fn strip_html_tags(html: &str) -> String { let mut result = String::new(); let mut in_tag = false; for ch in html.chars() { ... } } ``` - **Problems**: - Does not handle malformed HTML (e.g., `<img src=x onerror=alert(1)>` → `onerror=alert(1)` remains). - Does not decode HTML entities (`&amp;`, `&lt;`, `&gt;`). - Does not strip `<script>`, `<style>`, or event handlers. - **Security risk**: If `item["content"]` contains malicious HTML (e.g., `<img src=x onerror=eval(atob(...))>`), the stripped text may still contain JS or XSS vectors. ✅ **Fix**: - Use a robust HTML parser like [`html5ever`](https://docs.rs/html5ever/) or [`scraper`](https://docs.rs/scraper/) (already used elsewhere in the project? check). - At minimum: decode HTML entities (e.g., with `urlencoding::decode` or `html_escape` crate), and strip `<script>`, `<style>`, and event attributes. > 📌 **Suggestion**: If `html` is trusted (e.g., from Azure DevOps API), the risk is lower—but still present if `content` is user-generated. ##### 🔴 **High: Missing input validation on `expanded_queries`** - `expand_query(query)` returns `Vec<String>`, but no validation is done on the *length* or *content* of expanded queries. - If `expand_query` returns very long strings or malicious payloads (e.g., SQL-injection-like strings), they’re passed directly to API calls. ✅ **Fix**: - Cap `expanded_query.len()` (e.g., ≤ 256 chars). - Sanitize or reject queries with control characters or suspicious patterns. --- #### **3) Security Issues** ##### 🟠 **Medium: Cookie header used directly (XSS/CSRF risk)** ```rust let cookie_header = crate::integrations::webview_auth::cookies_to_header(cookies); ... .header("Cookie", &cookie_header) ``` - **Risk**: If `cookies` contain malicious data (e.g., `;evil=...`), it could corrupt the header or leak auth tokens. - **Mitigation**: Ensure `cookies_to_header` properly escapes/validates cookie values. ✅ **Fix**: Validate cookie values (e.g., reject `;`, `\r`, `\n`, or use `Cookie::new()` from `http` crate). ##### 🟠 **Medium: No rate limiting or backoff for multiple API calls** - Now making up to `MAX_EXPANDED_QUERIES * 2` (Wiki + Work Items) API calls per query. - Azure DevOps has rate limits (e.g., 600 req/5min per project). Without backoff, this could hit limits. ✅ **Fix**: - Add retry logic with exponential backoff (e.g., `backoff` crate). - Consider parallelizing with `futures::future::join_all` + `try_join_all`, but cap concurrency (e.g., `semaphore`). --- #### **4) Best Practices** ##### ✅ **Good: Deduplication & sorting** ```rust all_results.sort_by(|a, b| a.url.cmp(&b.url)); all_results.dedup_by(|a, b| a.url == b.url); ``` - Correct use of `dedup_by` after sorting. ##### ✅ **Good: Graceful error handling** - Using `continue` instead of `return Err(...)` when a single expanded query fails → improves robustness. ##### ✅ **Good: Unit tests for escaping** - Comprehensive `escape_wiql` and `strip_html_tags` tests. ##### 🟡 **Improvement: Magic number `MAX_EXPANDED_QUERIES = 3`** - Used in both Azure DevOps and Confluence, but: - Why 3? Document rationale. - Consider making it configurable (e.g., via feature flag or config file). ##### 🟡 **Improvement: Missing doc comments** - `escape_wiql`, `strip_html_tags`, `escape_cql`, `canonicalize_url` lack `///` docs. ✅ **Fix**: Add doc comments explaining purpose, edge cases, and security implications. ##### 🟡 **Improvement: Inconsistent quoting** - In `search_confluence`, the CQL query uses `text~\"{}\"` (with double quotes), but `escape_cql` escapes `"` → `\"`. This is correct *if* the CQL parser expects escaped double quotes inside double-quoted strings. - However, Azure DevOps uses `text ~ 'query'`, while Confluence uses `text ~ "query"` (double quotes for CQL). Ensure consistency. ✅ **Fix**: Use single quotes for CQL string literals if possible (Confluence supports both), or ensure `escape_cql` handles double quotes correctly *inside* double-quoted strings. --- ### 🔒 **Security Checklist** | Issue | Severity | Status | |-------|----------|--------| | Invalid WIQL syntax (`CONTAINS` instead of `~`) | 🔴 Critical | ❌ Must fix | | Incomplete WIQL escaping (`~`, `*`, `?`, `;`) | 🔴 High | ❌ Must fix | | Naive HTML stripping (XSS risk) | 🔴 High | ❌ Must fix | | Cookie header unvalidated | 🟠 Medium | ⚠️ Consider | | No rate limiting/backoff | 🟠 Medium | ⚠️ Consider | | No query length validation | 🟡 Low | ⚠️ Recommended | --- ### ✅ **Recommended Fixes Before Merge** 1. **Fix WIQL syntax**: Replace `CONTAINS` → `~`. 2. **Improve `escape_wiql`**: Ensure all special characters in `~` queries are escaped (especially `~`, `*`, `?`, `;`, `=`, `(`, `)`). 3. **Replace HTML stripping** with a proper parser (e.g., `scraper::Html::parse_document(...)` + `text_content()`). 4. **Add query length validation** (e.g., `expanded_query.len() <= 256`). 5. **Add doc comments** for all new functions. 6. **Document `MAX_EXPANDED_QUERIES`** rationale and consider making it configurable. --- ### 📌 Final Verdict **Do not merge in current state** due to **critical WIQL syntax error** and **incomplete escaping**. Once fixed, this is a solid, well-tested enhancement that improves search relevance and resilience. Let me know if you'd like a patch with suggested fixes!\n\n---\n*this is an automated review from Ollama*
sarman added 1 commit 2026-04-15 01:38:38 +00:00
fix: correct WIQL syntax and escape_wiql implementation
All checks were successful
Test / rust-fmt-check (pull_request) Successful in 10s
Test / frontend-typecheck (pull_request) Successful in 1m11s
Test / frontend-tests (pull_request) Successful in 1m12s
PR Review Automation / review (pull_request) Successful in 3m6s
Test / rust-clippy (pull_request) Successful in 3m49s
Test / rust-tests (pull_request) Successful in 5m4s
bc50a78db7
- Replace CONTAINS with ~ operator (correct WIQL syntax for text matching)
- Remove escaping of ~, *, ? which are valid WIQL wildcards
- Update tests to reflect correct escape_wiql behavior
sarman reviewed 2026-04-15 01:41:41 +00:00
sarman left a comment
Author
Owner

🤖 Automated PR Review:\n\n### Code Review: Query Expansion for Semantic Search in Azure DevOps & Confluence Integrations


1. Summary

This PR introduces query expansion to enhance semantic search across Azure DevOps (Wiki & Work Items) and Confluence. It adds a dependency on the url crate, implements query expansion logic, and applies it to both integrations. Key improvements include:

  • Replacing single queries with multiple expanded variants
  • Deduplication and sorting of results
  • Input sanitization (escape_wiql, escape_cql, strip_html_tags)
  • HTML stripping for excerpts to mitigate XSS

The implementation is generally well-structured, but several critical and high-severity issues exist—particularly around security (XSS, injection), correctness (buggy escaping, incomplete sanitization), and maintainability.


2. Bugs & Errors

2.1. escape_wiql is incomplete and incorrect for WIQL
  • Issue: The WIQL escaping function does not escape the ~ operator (used for fuzzy search in WIQL), and does not handle the CONTAINS/~ context correctly.

    • In WIQL, ~ is used for fuzzy matching (e.g., Title ~ 'query'), and must be escaped when part of the literal string, but the current function escapes ~\~, which breaks the query logic.
    • Worse: The PR uses ~ in the WIQL query (... ~ '{safe_query}'), but ~ in WIQL is not a wildcard—it’s for fuzzy matching with a similarity threshold. If safe_query contains ~, the query becomes malformed or behaves unexpectedly.
    • Example: escape_wiql("foo~bar")"foo\~bar", but WIQL interprets \~ as invalid syntax.
  • Fix:

    • Do NOT use ~ for literal matching in WIQL. Use CONTAINS with escaped quotes/backslashes only.
    • Replace:
      "SELECT ... WHERE ... ([System.Title] ~ '{safe_query}' OR ...)"
      
      with:
      "SELECT ... WHERE ([System.Title] CONTAINS '{safe_query}' OR [System.Description] CONTAINS '{safe_query}')"
      
    • Then, only escape ', ", \, (, ), ;, = — but not ~.
    • Also, ensure CONTAINS is used with single quotes in WIQL, and escape ' as ''.
  • Corrected escape_wiql (for CONTAINS):

    fn escape_wiql(s: &str) -> String {
        s.replace('\'', "''")  // Escape single quotes
         .replace('\\', "\\\\") // Must be first to avoid double-escaping
         .replace('"', "\\\"")
         .replace('(', "\\(")
         .replace(')', "\\)")
         .replace(';', "\\;")
         .replace('=', "\\=")
    }
    
2.2. escape_cql is missing critical characters
  • Issue: Confluence CQL requires escaping of many special characters, but escape_cql omits:

    • * (wildcard), ? (single-char wildcard), ! (negation), ^ (boost), {, }, [, ], : (field operators), AND, OR, NOT (keywords if unquoted).
    • Example: A user query "foo*bar" becomes text~"foo*bar"* is a wildcard, potentially matching unintended items.
  • Fix: Use urlencoding on the entire CQL string, or properly quote and escape the query inside double quotes:

    // Better: wrap in double quotes and escape internal quotes/backslashes only
    fn escape_cql(s: &str) -> String {
        s.replace('\\', "\\\\")
         .replace('"', "\\\"")
         .replace('\n', " ")
         .replace('\r', " ")
    }
    // Then use: text~"{safe_query}"
    
2.3. strip_html_tags is naive and unsafe
  • Issue:

    • Does not handle malformed HTML (e.g., <img src=x onerror=alert(1)> → still returns onerror=alert(1) if > is missing).
    • Does not decode HTML entities (e.g., &lt;script&gt; becomes &lt;script&gt;, not <script>).
    • Does not prevent XSS in attributes (e.g., <a href="javascript:alert(1)">href="javascript:alert(1)" remains in output).
    • Result: Excerpts may contain executable JS if the source HTML is malicious.
  • Fix:

    • Use a robust HTML parser like html5ever or htmlparser.
    • At minimum, use html_escape::decode_html_entities(&html) + a simple tag stripper:
      use html_escape::decode_html_entities;
      
      fn strip_html_tags(html: &str) -> String {
          let decoded = decode_html_entities(html);
          // Then strip tags safely
          // ... (or use `html5ever` for full sanitization)
      }
      
    • For security-critical UI (e.g., displaying excerpts), consider using text_content from html5ever or a whitelist-based sanitizer.
2.4. Deduplication uses .sort_by(|a, b| a.url.cmp(&b.url)) + .dedup_by(|a, b| a.url == b.url)
  • Issue: a.url.cmp(&b.url) is lexicographic, but URLs may differ only in fragment/query (e.g., foo.com#section1 vs foo.com#section2). This causes false deduplication (same page, different sections treated as duplicates).

    • Worse: canonicalize_url (in confluence_search.rs) strips fragments/queries, but azuredevops_search.rs does not use it for deduplication.
  • Fix:

    • Use canonicalize_url (or a URL normalization function) before deduplication:
      all_results.sort_by(|a, b| canonicalize_url(&a.url).cmp(&canonicalize_url(&b.url)));
      all_results.dedup_by(|a, b| canonicalize_url(&a.url) == canonicalize_url(&b.url));
      
    • Or deduplicate by (title, url) or (source, id) if available.

3. Security Issues

3.1. XSS in Azure DevOps excerpts
  • Issue: strip_html_tags is used after extracting item["content"], but:

    • If item["content"] contains <script> tags, the current function may still output dangerous content (e.g., onerror=... if tags are malformed).
    • No sanitization of title, path, or url fields (though less critical).
  • Fix:

    • Sanitize all user-provided content before display:
      let excerpt = strip_html_tags(item["content"].as_str().unwrap_or(""));
      let title = strip_html_tags(item["fileName"].as_str().unwrap_or("Untitled"));
      
    • Use a proper sanitizer (see 2.3).
3.2. URL injection risk in fetch_wiki_page / fetch_work_item_details
  • Issue: The org_url and project values are used directly in URL construction. If either is untrusted (e.g., from user input), this could lead to SSRF or redirect attacks.

    • Not visible in this diff, but ensure org_url is validated (e.g., only allow known domains, block file://, ftp://, etc.).
  • Fix:

    • Validate org_url against a whitelist of allowed domains.
    • Use Url::parse + Url::host_str() to verify scheme/domain.
3.3. url crate dependency added but not used in azuredevops_search.rs
  • Issue: url = "2.5.8" is added to Cargo.toml and Cargo.lock, but only used in confluence_search.rs. In azuredevops_search.rs, Url::parse is not used.

  • Fix:

    • Move url usage to azuredevops_search.rs for URL validation (e.g., Url::parse(&org_url).map_err(...)?).
    • Or remove the unused dependency if confluence_search.rs is the only consumer.

4. Best Practices

4.1. Error handling is inconsistent
  • Issue:

    • search_wiki returns an error on HTTP failure, but search_work_items returns Ok(Vec::new()) or continues on failure.
    • Inconsistent behavior: one fails hard, the other silently degrades.
  • Fix:

    • Standardize error handling: always continue on non-critical failures (e.g., 404, 503), but log warnings.
    • Return Ok(Vec::new()) only if all queries fail (not just one).
4.2. Hardcoded MAX_EXPANDED_QUERIES = 3
  • Issue: Magic number used in two places. No configuration or documentation.

  • Fix:

    • Move to config.rs or constants.rs:
      pub const MAX_EXPANDED_QUERIES: usize = 3; // or load from env/config
      
    • Add a comment explaining the rationale (e.g., "Balances search coverage vs. API quota").
4.3. Tests for escape_wiql are incomplete
  • Issue: Tests cover basic escaping but miss:

    • Edge cases: empty string, very long strings, Unicode (e.g., (curly quote) vs ').
    • Real-world WIQL injection attempts (e.g., "'; DROP TABLE WorkItems--").
  • Fix:

    • Add tests for:
      #[test]
      fn test_escape_wiql_empty() { assert_eq!(escape_wiql(""), ""); }
      #[test]
      fn test_escape_wiql_unicode() { assert_eq!(escape_wiql("test’single"), "test''single"); } // Note: ’ ≠ '
      #[test]
      fn test_escape_wiql_injection() { 
          assert_eq!(escape_wiql("'; DROP TABLE WorkItems--"), "''; DROP TABLE WorkItems--");
      }
      
4.4. Missing documentation
  • Issue: expand_query is used but not defined in this PR. Its behavior (e.g., how many expansions, latency, fallbacks) is unknown.

  • Fix:

    • Add documentation for expand_query (e.g., in query_expansion.rs):
      /// Expands a user query into a list of semantically related queries.
      /// Returns up to 3 queries (original + expansions), sorted by relevance.
      /// Falls back to original query if expansion fails.
      pub fn expand_query(query: &str) -> Vec<String> { ... }
      

Recommendations Before Merge

  1. Critical: Fix escape_wiql to not escape ~, and switch to CONTAINS instead of ~.
  2. High: Replace strip_html_tags with a robust HTML sanitizer.
  3. High: Fix deduplication to use normalized URLs.
  4. Medium: Standardize error handling across search functions.
  5. Medium: Add URL validation for org_url.
  6. Low: Move MAX_EXPANDED_QUERIES to a config constant.
  7. Low: Add more comprehensive tests for escaping.

Final Verdict

Do not merge until critical security and correctness issues are fixed. The query expansion feature is valuable, but the current implementation risks:

  • XSS in search excerpts,
  • WIQL injection (if escape_wiql is misused),
  • Incorrect search results (due to escaping/deduplication bugs).

Once fixed, this PR will significantly improve search quality and security.\n\n---\nthis is an automated review from Ollama

🤖 Automated PR Review:\n\n### Code Review: Query Expansion for Semantic Search in Azure DevOps & Confluence Integrations --- #### **1. Summary** This PR introduces **query expansion** to enhance semantic search across Azure DevOps (Wiki & Work Items) and Confluence. It adds a dependency on the `url` crate, implements query expansion logic, and applies it to both integrations. Key improvements include: - Replacing single queries with multiple expanded variants - Deduplication and sorting of results - Input sanitization (`escape_wiql`, `escape_cql`, `strip_html_tags`) - HTML stripping for excerpts to mitigate XSS The implementation is generally well-structured, but several **critical and high-severity issues** exist—particularly around **security (XSS, injection), correctness (buggy escaping, incomplete sanitization), and maintainability**. --- #### **2. Bugs & Errors** ##### **2.1. `escape_wiql` is incomplete and incorrect for WIQL** - **Issue**: The WIQL escaping function does **not escape the `~` operator** (used for fuzzy search in WIQL), and **does not handle the `CONTAINS`/`~` context correctly**. - In WIQL, `~` is used for *fuzzy matching* (e.g., `Title ~ 'query'`), and **must be escaped when part of the literal string**, but the current function escapes `~` → `\~`, which breaks the query logic. - Worse: The PR uses `~` in the WIQL query (`... ~ '{safe_query}'`), but `~` in WIQL is **not a wildcard**—it’s for fuzzy matching with a similarity threshold. If `safe_query` contains `~`, the query becomes malformed or behaves unexpectedly. - **Example**: `escape_wiql("foo~bar")` → `"foo\~bar"`, but WIQL interprets `\~` as invalid syntax. - **Fix**: - **Do NOT use `~` for literal matching in WIQL**. Use `CONTAINS` with escaped quotes/backslashes only. - Replace: ```rust "SELECT ... WHERE ... ([System.Title] ~ '{safe_query}' OR ...)" ``` with: ```rust "SELECT ... WHERE ([System.Title] CONTAINS '{safe_query}' OR [System.Description] CONTAINS '{safe_query}')" ``` - Then, only escape `'`, `"`, `\`, `(`, `)`, `;`, `=` — but **not `~`**. - Also, ensure `CONTAINS` is used with **single quotes** in WIQL, and escape `'` as `''`. - **Corrected `escape_wiql`** (for `CONTAINS`): ```rust fn escape_wiql(s: &str) -> String { s.replace('\'', "''") // Escape single quotes .replace('\\', "\\\\") // Must be first to avoid double-escaping .replace('"', "\\\"") .replace('(', "\\(") .replace(')', "\\)") .replace(';', "\\;") .replace('=', "\\=") } ``` ##### **2.2. `escape_cql` is missing critical characters** - **Issue**: Confluence CQL requires escaping of many special characters, but `escape_cql` omits: - `*` (wildcard), `?` (single-char wildcard), `!` (negation), `^` (boost), `{`, `}`, `[`, `]`, `:` (field operators), `AND`, `OR`, `NOT` (keywords if unquoted). - **Example**: A user query `"foo*bar"` becomes `text~"foo*bar"` → `*` is a wildcard, potentially matching unintended items. - **Fix**: Use `urlencoding` on the *entire* CQL string, or properly quote and escape the query inside double quotes: ```rust // Better: wrap in double quotes and escape internal quotes/backslashes only fn escape_cql(s: &str) -> String { s.replace('\\', "\\\\") .replace('"', "\\\"") .replace('\n', " ") .replace('\r', " ") } // Then use: text~"{safe_query}" ``` ##### **2.3. `strip_html_tags` is naive and unsafe** - **Issue**: - Does **not handle malformed HTML** (e.g., `<img src=x onerror=alert(1)>` → still returns `onerror=alert(1)` if `>` is missing). - Does **not decode HTML entities** (e.g., `&lt;script&gt;` becomes `&lt;script&gt;`, not `<script>`). - Does **not prevent XSS in attributes** (e.g., `<a href="javascript:alert(1)">` → `href="javascript:alert(1)"` remains in output). - **Result**: Excerpts may contain executable JS if the source HTML is malicious. - **Fix**: - Use a robust HTML parser like [`html5ever`](https://crates.io/crates/html5ever) or [`htmlparser`](https://crates.io/crates/htmlparser). - At minimum, use `html_escape::decode_html_entities(&html)` + a simple tag stripper: ```rust use html_escape::decode_html_entities; fn strip_html_tags(html: &str) -> String { let decoded = decode_html_entities(html); // Then strip tags safely // ... (or use `html5ever` for full sanitization) } ``` - For security-critical UI (e.g., displaying excerpts), consider using `text_content` from `html5ever` or a whitelist-based sanitizer. ##### **2.4. Deduplication uses `.sort_by(|a, b| a.url.cmp(&b.url))` + `.dedup_by(|a, b| a.url == b.url)`** - **Issue**: `a.url.cmp(&b.url)` is lexicographic, but URLs may differ only in fragment/query (e.g., `foo.com#section1` vs `foo.com#section2`). This causes **false deduplication** (same page, different sections treated as duplicates). - Worse: `canonicalize_url` (in `confluence_search.rs`) strips fragments/queries, but `azuredevops_search.rs` does **not** use it for deduplication. - **Fix**: - Use `canonicalize_url` (or a URL normalization function) before deduplication: ```rust all_results.sort_by(|a, b| canonicalize_url(&a.url).cmp(&canonicalize_url(&b.url))); all_results.dedup_by(|a, b| canonicalize_url(&a.url) == canonicalize_url(&b.url)); ``` - Or deduplicate by `(title, url)` or `(source, id)` if available. --- #### **3. Security Issues** ##### **3.1. XSS in Azure DevOps excerpts** - **Issue**: `strip_html_tags` is used *after* extracting `item["content"]`, but: - If `item["content"]` contains `<script>` tags, the current function may still output dangerous content (e.g., `onerror=...` if tags are malformed). - No sanitization of `title`, `path`, or `url` fields (though less critical). - **Fix**: - Sanitize *all* user-provided content before display: ```rust let excerpt = strip_html_tags(item["content"].as_str().unwrap_or("")); let title = strip_html_tags(item["fileName"].as_str().unwrap_or("Untitled")); ``` - Use a proper sanitizer (see `2.3`). ##### **3.2. URL injection risk in `fetch_wiki_page` / `fetch_work_item_details`** - **Issue**: The `org_url` and `project` values are used directly in URL construction. If either is untrusted (e.g., from user input), this could lead to SSRF or redirect attacks. - **Not visible in this diff**, but ensure `org_url` is validated (e.g., only allow known domains, block `file://`, `ftp://`, etc.). - **Fix**: - Validate `org_url` against a whitelist of allowed domains. - Use `Url::parse` + `Url::host_str()` to verify scheme/domain. ##### **3.3. `url` crate dependency added but not used in `azuredevops_search.rs`** - **Issue**: `url = "2.5.8"` is added to `Cargo.toml` and `Cargo.lock`, but only used in `confluence_search.rs`. In `azuredevops_search.rs`, `Url::parse` is not used. - **Fix**: - Move `url` usage to `azuredevops_search.rs` for URL validation (e.g., `Url::parse(&org_url).map_err(...)?`). - Or remove the unused dependency if `confluence_search.rs` is the only consumer. --- #### **4. Best Practices** ##### **4.1. Error handling is inconsistent** - **Issue**: - `search_wiki` returns an error on HTTP failure, but `search_work_items` returns `Ok(Vec::new())` or `continue`s on failure. - Inconsistent behavior: one fails hard, the other silently degrades. - **Fix**: - Standardize error handling: always `continue` on non-critical failures (e.g., 404, 503), but log warnings. - Return `Ok(Vec::new())` only if *all* queries fail (not just one). ##### **4.2. Hardcoded `MAX_EXPANDED_QUERIES = 3`** - **Issue**: Magic number used in two places. No configuration or documentation. - **Fix**: - Move to `config.rs` or `constants.rs`: ```rust pub const MAX_EXPANDED_QUERIES: usize = 3; // or load from env/config ``` - Add a comment explaining the rationale (e.g., "Balances search coverage vs. API quota"). ##### **4.3. Tests for `escape_wiql` are incomplete** - **Issue**: Tests cover basic escaping but miss: - Edge cases: empty string, very long strings, Unicode (e.g., `’` (curly quote) vs `'`). - Real-world WIQL injection attempts (e.g., `"'; DROP TABLE WorkItems--"`). - **Fix**: - Add tests for: ```rust #[test] fn test_escape_wiql_empty() { assert_eq!(escape_wiql(""), ""); } #[test] fn test_escape_wiql_unicode() { assert_eq!(escape_wiql("test’single"), "test''single"); } // Note: ’ ≠ ' #[test] fn test_escape_wiql_injection() { assert_eq!(escape_wiql("'; DROP TABLE WorkItems--"), "''; DROP TABLE WorkItems--"); } ``` ##### **4.4. Missing documentation** - **Issue**: `expand_query` is used but not defined in this PR. Its behavior (e.g., how many expansions, latency, fallbacks) is unknown. - **Fix**: - Add documentation for `expand_query` (e.g., in `query_expansion.rs`): ```rust /// Expands a user query into a list of semantically related queries. /// Returns up to 3 queries (original + expansions), sorted by relevance. /// Falls back to original query if expansion fails. pub fn expand_query(query: &str) -> Vec<String> { ... } ``` --- ### **Recommendations Before Merge** 1. **Critical**: Fix `escape_wiql` to not escape `~`, and switch to `CONTAINS` instead of `~`. 2. **High**: Replace `strip_html_tags` with a robust HTML sanitizer. 3. **High**: Fix deduplication to use normalized URLs. 4. **Medium**: Standardize error handling across search functions. 5. **Medium**: Add URL validation for `org_url`. 6. **Low**: Move `MAX_EXPANDED_QUERIES` to a config constant. 7. **Low**: Add more comprehensive tests for escaping. --- ### **Final Verdict** **Do not merge** until critical security and correctness issues are fixed. The query expansion feature is valuable, but the current implementation risks: - **XSS** in search excerpts, - **WIQL injection** (if `escape_wiql` is misused), - **Incorrect search results** (due to escaping/deduplication bugs). Once fixed, this PR will significantly improve search quality and security.\n\n---\n*this is an automated review from Ollama*
sarman merged commit a9956a16a4 into master 2026-04-15 01:44:43 +00:00
sarman deleted branch feature/integration-search-expansion 2026-04-15 01:44:50 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: sarman/tftsr-devops_investigation#44
No description provided.