fix: migrate OpenAI provider to use Responses API#11674
fix: migrate OpenAI provider to use Responses API#11674devbyteai wants to merge 6 commits intoSignificant-Gravitas:devfrom
Conversation
Updates the OpenAI provider in llm.py to use the newer Responses API (responses.create) instead of the legacy Chat Completions API (chat.completions.create). Changes: - Replace chat.completions.create with responses.create - Use 'input' parameter instead of 'messages' - Use 'max_output_tokens' instead of 'max_completion_tokens' - Parse response.output_text instead of choices[0].message.content - Use input_tokens/output_tokens for token usage tracking - Add helper functions for extracting reasoning and tool calls from the Responses API response format - Update tests to mock the new API format The chat service is not migrated as it uses OpenRouter which requires the Chat Completions API format. Closes Significant-Gravitas#11624
WalkthroughMigrate OpenAI LLM provider from chat.completions.create to responses.create API with helper functions for parsing reasoning and tool calls, updated parameter mappings, and aligned test structures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Disabled knowledge base sources:
📒 Files selected for processing (4)
🧰 Additional context used📓 Path-based instructions (9)autogpt_platform/backend/**/*.py📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
autogpt_platform/backend/backend/blocks/**/*.py📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
autogpt_platform/**/*.{ts,tsx,js,py}📄 CodeRabbit inference engine (AGENTS.md)
Files:
autogpt_platform/frontend/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
autogpt_platform/frontend/**/*.{ts,tsx,json}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
autogpt_platform/frontend/src/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
autogpt_platform/frontend/**/*.{tsx,ts}📄 CodeRabbit inference engine (autogpt_platform/CLAUDE.md)
Files:
autogpt_platform/frontend/**/*.ts📄 CodeRabbit inference engine (autogpt_platform/CLAUDE.md)
Files:
autogpt_platform/frontend/**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (2)📚 Learning: 2026-01-28T18:29:34.347ZApplied to files:
📚 Learning: 2026-01-28T18:29:18.309ZApplied to files:
🧬 Code graph analysis (2)autogpt_platform/backend/backend/blocks/test/test_llm.py (1)
autogpt_platform/frontend/src/services/feature-flags/use-get-flag.ts (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
🔇 Additional comments (2)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
This PR targets the Automatically setting the base branch to |
wasnt in my first commit dunno wtf with this
Changes SummaryThis PR migrates the native OpenAI API integration from the deprecated Chat Completions endpoint to the new Responses API (v1.66.0+). It introduces two new helper functions to parse Responses API output, updates the OpenAI provider section to use the new endpoint, and adjusts parameter mapping and response parsing accordingly. Type: feature Components Affected: backend.blocks.llm (OpenAI provider), test.test_llm (test mocks), LLM infrastructure Files Changed
Architecture Impact
Risk Areas: Response field mapping differences: output_text vs message.content, input_tokens/output_tokens vs prompt_tokens/completion_tokens, Tool call extraction: handling call_id vs id field in Responses API output, Reasoning extraction: handling both string and array summary formats in reasoning output, Error handling: new try/catch for openai.APIError (follows Anthropic pattern), System message extraction: moved from messages to instructions parameter, Response format parameter mapping: response_format to text parameter Suggestions
Full review in progress... | Powered by diffray |
Review Summary
Validated 78 issues: 42 kept, 36 filtered Issues Found: 42💬 See 17 individual line comment(s) for details. 📊 23 unique issue type(s) across 42 location(s) 📋 Full issue list (click to expand)🔴 CRITICAL - Array bounds check missing for response.choices[0] (3 occurrences)Agent: bugs Category: bug Why this matters: Accessing empty arrays throws IndexError/undefined access. 📍 View all locations
Rule: 🔴 CRITICAL - Redundant Optional usage with union type syntaxAgent: python Category: quality Why this matters: Type hints enable IDE autocomplete and catch type errors early. File: Description: Line 326 uses both 'Optional[List[ToolContentBlock]]' and '| None' on the same field, creating redundancy and confusion. Optional[X] is equivalent to X | None. Suggestion: Change to: tool_calls: list[ToolContentBlock] | None = None Confidence: 98% Rule: 🟠 HIGH - Redundant and illogical None comparison (2 occurrences)Agent: python Category: bug Why this matters: Redundant None check indicates logic error or dead branch. 📍 View all locations
Rule: 🟠 HIGH - Sequential async calls in loop instead of parallel gatheringAgent: performance Category: performance Why this matters: N+1 queries cause severe performance degradation. File: Description: The _run method in AITextSummarizerBlock processes chunks sequentially using a for loop with await on each chunk. With multiple chunks, this becomes O(n) serialized operations when they could be parallelized. Suggestion: Use asyncio.gather() or asyncio.TaskGroup to summarize all chunks concurrently, reducing wall-clock time from O(n*t) to O(t). Confidence: 88% Rule: 🟠 HIGH - Inconsistent mock setup for async HTTP calls (2 occurrences)Agent: microservices Category: bug Why this matters: Live I/O introduces slowness, nondeterminism, and external failures unrelated to the code. 📍 View all locations
Rule: 🟠 HIGH - Error logged without sufficient context for debuggingAgent: bugs Category: bug Why this matters: Context-free errors are impossible to trace in production logs. File: Description: When catching openai.APIError, the error message only includes the error text but lacks context about what model was being called, the input parameters, or the operation context. This makes debugging and monitoring difficult in production logs. Suggestion: Include relevant context in error message: add llm_model value, a summary of input messages, and operation context. Example: f'OpenAI Responses API error for model {llm_model.value}: {str(e)}' Confidence: 72% Rule: 🟠 HIGH - God Function with 329 lines handling 8 different LLM providersAgent: architecture Category: quality Why this matters: Framework coupling makes code harder to test and migrate. File: Description: The llm_call function is a monolithic function with completely separate logic paths for 8 LLM providers. Each provider has its own client instantiation, request building, response parsing, error handling, and token counting. This violates Single Responsibility Principle. Suggestion: Refactor using Strategy pattern: create an abstract LLMProvider base class with provider-specific implementations. Use a factory to instantiate the correct provider based on model metadata. Confidence: 75% Rule: 🟠 HIGH - Expensive regex substitution on every parse failureAgent: performance Category: performance File: Description: Line 1048 executes re.sub(r'[A-Za-z0-9]', '*', response_text) to censor responses for logging on every parse failure. The regex is recompiled each time and operates on potentially large text. Suggestion: Pre-compile the regex at module level: CENSOR_PATTERN = re.compile(r'[A-Za-z0-9]'). Better: only censor a fixed-size snippet instead of entire response. Confidence: 85% Rule: 🟠 HIGH - Missing timeout on external service callAgent: python Category: bug File: Description: The oai_client.responses.create() call at line 525 has no explicit timeout parameter. Compare with anthropic client at line 575 which properly includes timeout=600. Suggestion: Add a timeout parameter to the responses_params dictionary before making the call, similar to the Anthropic call. Confidence: 90% Rule: 🟠 HIGH - Incomplete error handling for external service callAgent: python Category: bug File: Description: The oai_client.responses.create() call only catches openai.APIError. Other exceptions like asyncio.TimeoutError, ConnectionError are not caught. Suggestion: Expand the try-except to catch additional exceptions: asyncio.TimeoutError, ConnectionError, etc. Confidence: 80% Rule: 🟠 HIGH - Overly broad exception handlingAgent: python Category: quality File: Description: Bare Exception clause catches all exceptions including SystemExit and KeyboardInterrupt. Should catch specific exception types. Suggestion: Replace 'except Exception as e' with specific exception types like (openai.APIError, anthropic.APIError, ValueError). Confidence: 90% Rule: 🟠 HIGH - Input parameter 'input_data' is modified in-place (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟠 HIGH - Duplicate import in function body (5 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Test mocks internal method - limited integration coverage (2 occurrences)Agent: testing Category: quality Why this matters: Improper mocks make tests brittle and unreliable. 📍 View all locations
Rule: 🟡 MEDIUM - Missing return type annotation (2 occurrences)Agent: python Category: quality Why this matters: Type hints prevent whole classes of bugs and improve IDE/refactor support. 📍 View all locations
Rule: 🟡 MEDIUM - Double loop through response.content blocks (2 occurrences)Agent: performance Category: performance 📍 View all locations
Rule: 🟡 MEDIUM - Debug print statements in test codeAgent: python Category: quality File: Description: Debug print() statements left in test function. Should be removed or replaced with logging. Suggestion: Replace with logger.debug() calls or use pytest's caplog fixture, or remove if no longer needed. Confidence: 85% Rule: 🟡 MEDIUM - Test assertion comments contradict expected behaviorAgent: refactoring Category: quality File: Description: Line 187 states 'llm_call_count is only set on success, so it shows 1' but line 190 asserts llm_call_count == 2. The comment explanation is misleading - it actually should be retry_count + 1 = 2 as stated in line 190's inline comment. Suggestion: Remove or correct the misleading comment at line 187. The assertion at line 190 with its inline comment 'retry_count + 1 = 1 + 1 = 2' is correct, but line 187 contradicts this. Confidence: 75% Rule: 🟡 MEDIUM - User input formatted into prompts via Jinja2Agent: security Category: security File: Description: User-supplied prompt_values are formatted into system and user prompts using Jinja2 SandboxedEnvironment (via fmt.format_string()). While sandboxed, autoescape=False at initialization allows potential content manipulation in prompts. Suggestion: Consider enabling autoescape for the TextFormatter used in LLM blocks, or validate prompt_values don't contain suspicious patterns. The SandboxedEnvironment mitigates template injection but doesn't prevent prompt manipulation. Confidence: 65% Rule: 🟡 MEDIUM - User input directly embedded into LLM promptsAgent: security Category: security File: Description: User-supplied input_data.focus and input_data.source_data are directly embedded into prompts using f-strings. While this is common for LLM applications, sensitive data could be sent to external providers. Suggestion: Consider implementing PII detection or data sanitization for sensitive patterns before embedding user input into prompts sent to external LLM providers. Confidence: 62% Rule: 🟡 MEDIUM - Missing Input Validation for max_tokens Parameter (2 occurrences)Agent: security Category: security 📍 View all locations
Rule: 🟡 MEDIUM - OpenAI Responses API Call Without Explicit Timeout (6 occurrences)Agent: security Category: security 📍 View all locations
Rule: 🟡 MEDIUM - Test assertions too vague to catch bugs (2 occurrences)Agent: testing Category: quality 📍 View all locations
Rule: ℹ️ 25 issue(s) outside PR diff (click to expand)
🔴 CRITICAL - Redundant Optional usage with union type syntaxAgent: python Category: quality Why this matters: Type hints enable IDE autocomplete and catch type errors early. File: Description: Line 326 uses both 'Optional[List[ToolContentBlock]]' and '| None' on the same field, creating redundancy and confusion. Optional[X] is equivalent to X | None. Suggestion: Change to: tool_calls: list[ToolContentBlock] | None = None Confidence: 98% Rule: 🟠 HIGH - Redundant and illogical None comparison (2 occurrences)Agent: python Category: bug Why this matters: Redundant None check indicates logic error or dead branch. 📍 View all locations
Rule: 🟠 HIGH - Sequential async calls in loop instead of parallel gatheringAgent: performance Category: performance Why this matters: N+1 queries cause severe performance degradation. File: Description: The _run method in AITextSummarizerBlock processes chunks sequentially using a for loop with await on each chunk. With multiple chunks, this becomes O(n) serialized operations when they could be parallelized. Suggestion: Use asyncio.gather() or asyncio.TaskGroup to summarize all chunks concurrently, reducing wall-clock time from O(n*t) to O(t). Confidence: 88% Rule: 🟠 HIGH - Expensive regex substitution on every parse failureAgent: performance Category: performance File: Description: Line 1048 executes re.sub(r'[A-Za-z0-9]', '*', response_text) to censor responses for logging on every parse failure. The regex is recompiled each time and operates on potentially large text. Suggestion: Pre-compile the regex at module level: CENSOR_PATTERN = re.compile(r'[A-Za-z0-9]'). Better: only censor a fixed-size snippet instead of entire response. Confidence: 85% Rule: 🟠 HIGH - Overly broad exception handlingAgent: python Category: quality File: Description: Bare Exception clause catches all exceptions including SystemExit and KeyboardInterrupt. Should catch specific exception types. Suggestion: Replace 'except Exception as e' with specific exception types like (openai.APIError, anthropic.APIError, ValueError). Confidence: 90% Rule: 🟠 HIGH - Input parameter 'input_data' is modified in-place (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Test mocks internal method - limited integration coverage (2 occurrences)Agent: testing Category: quality Why this matters: Improper mocks make tests brittle and unreliable. 📍 View all locations
Rule: 🟡 MEDIUM - Missing type annotation for error parameterAgent: python Category: quality Why this matters: Type hints prevent whole classes of bugs and improve IDE/refactor support. File: Description: Parameter 'error' in method 'invalid_response_feedback' lacks type annotation. This method accepts different error types and should document them. Suggestion: Add type annotation: 'def invalid_response_feedback(self, error: Union[ValueError, JSONDecodeError, str], ...) -> str:' Confidence: 70% Rule: 🟡 MEDIUM - Double loop through response.content blocksAgent: performance Category: performance File: Description: The function loops through resp.content twice: lines 582-597 for tool_use and lines 605-608 for thinking type extraction. Suggestion: Combine into a single iteration through resp.content, checking for both types in one pass. Confidence: 70% Rule: 🟡 MEDIUM - Import inside function body (3 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Test assertion comments contradict expected behaviorAgent: refactoring Category: quality File: Description: Line 187 states 'llm_call_count is only set on success, so it shows 1' but line 190 asserts llm_call_count == 2. The comment explanation is misleading - it actually should be retry_count + 1 = 2 as stated in line 190's inline comment. Suggestion: Remove or correct the misleading comment at line 187. The assertion at line 190 with its inline comment 'retry_count + 1 = 1 + 1 = 2' is correct, but line 187 contradicts this. Confidence: 75% Rule: 🟡 MEDIUM - User input formatted into prompts via Jinja2Agent: security Category: security File: Description: User-supplied prompt_values are formatted into system and user prompts using Jinja2 SandboxedEnvironment (via fmt.format_string()). While sandboxed, autoescape=False at initialization allows potential content manipulation in prompts. Suggestion: Consider enabling autoescape for the TextFormatter used in LLM blocks, or validate prompt_values don't contain suspicious patterns. The SandboxedEnvironment mitigates template injection but doesn't prevent prompt manipulation. Confidence: 65% Rule: 🟡 MEDIUM - User input directly embedded into LLM promptsAgent: security Category: security File: Description: User-supplied input_data.focus and input_data.source_data are directly embedded into prompts using f-strings. While this is common for LLM applications, sensitive data could be sent to external providers. Suggestion: Consider implementing PII detection or data sanitization for sensitive patterns before embedding user input into prompts sent to external LLM providers. Confidence: 62% Rule: 🟡 MEDIUM - Missing Range Validation for chunk_overlap ParameterAgent: security Category: security File: Description: chunk_overlap has ge=0 but no upper bound. No cross-field validation ensures overlap < max_tokens, which could create invalid chunk configurations. Suggestion: Add validation to ensure chunk_overlap < max_tokens to prevent invalid chunking behavior. Confidence: 70% Rule: 🟡 MEDIUM - Groq API Call Without Explicit Timeout (5 occurrences)Agent: security Category: security 📍 View all locations
Rule: 🟡 MEDIUM - Test assertions too vague to catch bugsAgent: testing Category: quality File: Description: The test uses assertions like Suggestion: Replace Confidence: 75% Rule: Review ID: |
- Use getattr with fallback for tool call ID extraction - Add return type annotation to get_parallel_tool_calls_param - Add timeout (600s) to OpenAI Responses API call - Add model context to error messages - Handle TimeoutError in addition to APIError - Remove duplicate imports in test file - Remove debug print statements in test file
The raw_response field is used by smart_decision_maker.py for conversation history. It expects a message dict with 'role' and 'content' keys, not the raw Response object. - Construct message dict with role='assistant' and content - Include tool_calls in OpenAI format when present - Fixes multi-turn conversation and retry logic
ⓘ Your monthly quota for Qodo has expired. Upgrade your plan ⓘ Paying users. Check that your Qodo account is linked with this Git user account |
| 3. [OpenAI Reasoning Docs](https://platform.openai.com/docs/guides/reasoning) | ||
| 4. [Simon Willison's Comparison](https://simonwillison.net/2025/Mar/11/responses-vs-chat-completions/) | ||
| 5. [OpenAI Python SDK v1.66.0 Release](https://github.com/openai/openai-python/releases/tag/v1.66.0) | ||
|
|
|
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request. |
Summary
Migrates OpenAI native API calls from the deprecated
chat.completions.createendpoint to the newresponses.createendpoint as recommended by OpenAI.Fixes #11624
Changes
Core Changes
llm.pyto useclient.responses.create()extract_responses_api_reasoning()helper to parse reasoning output (handles both string and array summary formats)extract_responses_api_tool_calls()helper to parse function callsinstructionsparameter (Responses API requirement)Parameter Mapping (Chat Completions → Responses API)
messages→input(non-system messages only)instructionsparametermax_completion_tokens→max_output_tokensresponse_format={...}→text={"format":{...}}Response Parsing (Chat Completions → Responses API)
choices[0].message.content→output_textusage.prompt_tokens→usage.input_tokensusage.completion_tokens→usage.output_tokenschoices[0].message.tool_calls→outputitems withtype="function_call"Compatibility
SDK Version
API Compatibility
llm_call()function signature - UNCHANGEDLLMResponseclass structure - UNCHANGEDProvider Impact
openai- YES, modified (Native OpenAI - uses Responses API)anthropic- NO (Different SDK entirely)groq- NO (Third-party API, Chat Completions compatible)open_router- NO (Third-party API, Chat Completions compatible)llama_api- NO (Third-party API, Chat Completions compatible)ollama- NO (Uses ollama SDK)aiml_api- NO (Third-party API, Chat Completions compatible)v0- NO (Third-party API, Chat Completions compatible)Dependent Blocks Verified
smart_decision_maker.py(Line 508) - Uses: response, tool_calls, prompt_tokens, completion_tokens, reasoning - COMPATIBLEai_condition.py(Line 113) - Uses: response, prompt_tokens, completion_tokens, prompt - COMPATIBLEperplexity.py- Does not use llm_call (uses different API) - NOT AFFECTEDStreaming Service
backend/server/v2/chat/service.pyis NOT affected - it uses OpenRouter by default which requires Chat Completions API format.Testing
Test File Updates
test_llm.pymocks to useoutput_textinstead ofchoices[0].message.contentoutputarray for tool callsusage.input_tokens/usage.output_tokensVerification Performed
call_idwith fallback toidsummaryformatsRecommended Manual Testing
force_json_output=True)Files Modified
1.
autogpt_platform/backend/backend/blocks/llm.pyextract_responses_api_reasoning()helperextract_responses_api_tool_calls()helperresponses.createinstructionsparameter2.
autogpt_platform/backend/backend/blocks/test/test_llm.pyReferences
Checklist
Changes
Code Quality