Skip to content

Conversation

@TheFermiSea
Copy link

Fix: Apply HTTP client timeouts to prevent infinite hangs

Summary

Fixes a critical bug where octocode's GraphRAG indexing process hangs indefinitely during LLM API calls. The root cause is that reqwest::Client instances are created using Client::new() instead of the builder pattern, which prevents the configured batch_timeout_seconds from being applied.

Problem

When using LLM-powered GraphRAG features (use_llm = true), the indexing process hangs indefinitely at two points:

  1. File description generation (with ai_batch_size > 1)
  2. Architectural relationship extraction (always processes all files in one batch)

The configuration parameter graphrag.llm.batch_timeout_seconds is loaded but never applied to the HTTP client, causing requests to wait indefinitely when the LLM provider is slow to respond.

Root Cause Analysis

Primary Bug (src/indexer/graphrag/builder.rs:74)

// BEFORE (buggy):
let client = Client::new();

// AFTER (fixed):
let client = Client::builder()
    .timeout(std::time::Duration::from_secs(
        config.graphrag.llm.batch_timeout_seconds,
    ))
    .build()?;

Secondary Bug (src/embedding/provider/huggingface.rs:460)

// BEFORE (buggy):
let client = reqwest::Client::new();

// AFTER (fixed):
let client = reqwest::Client::builder()
    .timeout(std::time::Duration::from_secs(30))
    .build()?;

Reproduction

  1. Configure octocode with use_llm = true and ai_batch_size > 1
  2. Run octocode index on any codebase
  3. Process hangs at "AI analyzing X files for architectural relationships" with infinite spinner
  4. No timeout occurs even after configured batch_timeout_seconds expires

Testing

Before Fix

  • Both GPT-4.1-mini (default) and GPT-5-mini exhibited infinite hangs
  • Workaround with ai_batch_size = 1 only helped description phase
  • Relationship extraction always hung (processes 72 files in single batch)

After Fix

  • Timeouts properly trigger after configured duration
  • Failed requests return error messages instead of hanging forever
  • Large batch processing completes or fails gracefully

Impact

This bug made LLM-powered GraphRAG features unusable for any non-trivial codebase. The fix enables:

  • Reliable timeout behavior for all LLM API calls
  • Proper error handling and recovery
  • Predictable indexing duration

Related Documentation

Full technical analysis available at: octocode_timeout_analysis.md

Code Quality Improvements

In addition to the core timeout fixes, this PR includes:

Enhanced Error Handling

  • Added .context() error messages to both HTTP client build failures
  • GraphRAG client: "Failed to create HTTP client for LLM API calls"
  • HuggingFace client: "Failed to create HTTP client for HuggingFace downloads"

Documentation Comments

// IMPORTANT: Must use builder pattern with timeout to prevent infinite hangs
// when LLM API calls take too long. Client::new() does not apply timeouts.

These comments at both fix locations help prevent future regressions.

Code Verification

  • Zero clippy warnings
  • All existing Client::new() uses verified (3 instances in commands/ use request-level timeouts correctly)
  • No unwrap() issues in production code
  • Consistent error handling patterns

Technical Details

Why Client::new() Doesn't Apply Timeouts

The reqwest::Client::new() convenience method creates a client with default settings that do not include any timeout. To apply a timeout, you must use the builder pattern:

// ❌ WRONG - no timeout applied
let client = Client::new();

// ✅ CORRECT - timeout is applied
let client = Client::builder()
    .timeout(Duration::from_secs(120))
    .build()?;

Request-Level vs Client-Level Timeouts

This PR uses client-level timeouts. An alternative approach is request-level timeouts:

// Also valid, but less convenient for repeated requests
let client = Client::new();
let response = client.get(url)
    .timeout(Duration::from_secs(120))
    .send()
    .await?;

The codebase uses both patterns appropriately:

  • Client-level (this PR): For GraphRAG batch operations and HuggingFace downloads
  • Request-level: In src/commands/ where each request may need different timeouts

Test Results

Unit Tests

cargo test --all-features

Results: 93 passed, 3 failed

The 3 failures are unrelated FastEmbed lock acquisition issues:

  • test_fastembed_provider_creation
  • test_fastembed_model_validation
  • test_fastembed_embedding_generation

These failures are environmental (file lock contention in local cache directory), not caused by the timeout fix changes.

Integration Testing

Real-world test on rust-daq codebase (113 files):

  • File indexing: ✅ 113/113 files processed
  • GraphRAG blocks: ✅ 896 blocks created
  • Relationship extraction: ✅ Timeout triggered after 120s (expected behavior)
  • Before fix: Infinite hang
  • After fix: Graceful timeout with error message

Configuration Tested

[graphrag.llm]
batch_timeout_seconds = 120
ai_batch_size = 10
description_model = "openai/gpt-4.1-mini"
relationship_model = "google/gemini-2.0-flash-001"

Migration Guide for Developers

If you're creating new HTTP clients in octocode:

For LLM/API Calls

use reqwest::Client;
use anyhow::Context;

let client = Client::builder()
    .timeout(std::time::Duration::from_secs(
        config.graphrag.llm.batch_timeout_seconds,
    ))
    .build()
    .context("Failed to create HTTP client")?;

For File Downloads

let client = Client::builder()
    .timeout(std::time::Duration::from_secs(30))
    .build()
    .context("Failed to create HTTP client")?;

Don't Use Client::new()

Unless you have a specific reason to avoid timeouts (rare), always use the builder pattern.

Related Issues

This fix resolves the core issue documented in:

  • octocode_timeout_analysis.md (comprehensive technical analysis)
  • User reports of infinite hangs during GraphRAG indexing

Checklist

  • Code changes tested locally
  • Both timeout bugs fixed (GraphRAG + HuggingFace)
  • No breaking changes to API
  • Enhanced error messages added
  • Documentation comments added
  • Code quality verified (clippy clean)
  • Integration tested on real codebase
  • Unit tests passing (93/96 passed, 3 unrelated FastEmbed lock failures)
  • Ready for upstream PR

- Fix builder.rs:74: Use Client::builder() with batch_timeout_seconds
- Fix huggingface.rs:460: Use Client::builder() with 30s timeout
- Prevents infinite hangs during LLM API calls and model downloads

Resolves timeout bug documented in octocode_timeout_analysis.md
…ixes

Enhanced the timeout bug fixes with:
- Better error messages using .context() for HTTP client build failures
- Documentation comments explaining why builder pattern is required
- Clear warnings that Client::new() does not apply timeouts

These improvements help prevent future regressions and make the code
more maintainable by documenting the critical timeout requirement.

Related to the fix for infinite hangs during LLM GraphRAG operations.
Added two documentation files to support PR submission:

1. PR_DESCRIPTION.md
   - Detailed technical explanation of the bug and fix
   - Code quality improvements section
   - Integration test results from rust-daq codebase
   - Migration guide for developers
   - Comprehensive checklist

2. TESTING.md
   - Complete testing procedures (unit, integration, regression)
   - Manual testing checklist
   - Real-world test results
   - Troubleshooting guide
   - CI/CD recommendations

These documents provide all information needed for PR review and merge.
Ready for upstream submission.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant