Skip to content

test: add e2e tests for responses api#1983

Open
rcogal wants to merge 4 commits intomainfrom
test/responses-api-e2e
Open

test: add e2e tests for responses api#1983
rcogal wants to merge 4 commits intomainfrom
test/responses-api-e2e

Conversation

@rcogal
Copy link
Copy Markdown
Collaborator

@rcogal rcogal commented Apr 7, 2026

Summary

  • Adds apps/gateway/src/responses.e2e.ts with basic end-to-end coverage for POST /v1/responses (the OpenAI-compatible Responses API)
  • Three test cases parameterized over openai/gpt-4o-mini and anthropic/claude-haiku-4-5:
    1. single-turn — basic request, asserts resp_* id, output text, usage tokens, log persisted
    2. multi-turn — uses previous_response_id to verify the gateway reconstructs prior context (model recalls a name across turns)
    3. tool calls — full round-trip: model emits a function_call, we send back a function_call_output via previous_response_id, assert the final answer uses the tool result
  • Addresses Luca's feedback that the Responses API needs basic e2e coverage for multi-turn conversations and tool calls

Test plan

  • All 8 tests pass locally (~6s) against both models
  • CI e2e workflow picks up the new file automatically (matches **/*.e2e.ts glob, no workflow changes needed)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added end-to-end tests for the Responses endpoint covering single-turn and multi-turn conversations, tool-invocation flows, response formatting, token usage reporting, and request/response linkage across turns.
  • Bug Fixes
    • Prevented creation of empty assistant messages when tool responses include empty content.
  • Improvements
    • Fold consecutive tool-call items and any immediately following assistant text into a single assistant message for cleaner, more natural output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

Walkthrough

Adds a new Vitest end-to-end test suite for POST /v1/responses and updates Responses↔Chat conversion logic: (1) avoid emitting empty assistant message content when content is blank, and (2) fold consecutive function_call items with trailing assistant message text into a single assistant message.

Changes

Cohort / File(s) Summary
E2E Test Suite
apps/gateway/src/responses.e2e.ts
New Vitest e2e tests exercising POST /v1/responses: env/app bootstrap, shared e2e hooks, postResponses helper, output/function_call extractors, and test groups for smoke, single-turn, multi-turn (previous_response_id), and tool-call flows (including function_call and function_call_output assertions).
Convert chat -> responses
apps/gateway/src/responses/tools/convert-chat-to-responses.ts
Change: only emit assistant message output when message.content is non-null/defined and contains non-whitespace characters (trim() !== ""); prevents adding empty assistant message items for tool-call responses.
Convert responses -> chat
apps/gateway/src/responses/tools/convert-responses-to-chat.ts
Change: aggregate consecutive function_call/tool-call items into a single assistant message and fold any immediately-following assistant message text into that assistant message. Adds local helper extractTextFromContent to concatenate content parts.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Gateway as Gateway (app)
participant Provider as Model Provider
participant Tool as Tool/Adapter
participant Log as Logging

Client->>Gateway: POST /v1/responses (input, tools?, headers)
Gateway->>Log: log request id
Gateway->>Provider: forward request (select model)
Provider-->>Gateway: response (may include tool_call / function_call)
Gateway->>Tool: invoke tool if tool_call present
Tool-->>Gateway: tool result (function_call_output)
Gateway->>Gateway: convert Responses↔Chat (aggregate function_call + assistant text; omit empty assistant content)
Gateway->>Log: validate/log response by request id
Gateway-->>Client: 200 OK (response payload)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

auto-merge

Suggested reviewers

  • smakosh
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test: add e2e tests for responses api' directly and clearly describes the main change—adding new end-to-end tests for the responses API. It is concise, specific, and accurately summarizes the primary contribution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/responses-api-e2e

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
apps/gateway/src/responses.e2e.ts (2)

68-70: Remove the empty smoke test to keep signal-to-noise high.

test("empty") doesn’t validate behavior and adds runtime/maintenance noise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses.e2e.ts` around lines 68 - 70, Remove the useless
smoke test named test("empty") from apps/gateway/src/responses.e2e.ts: locate
the test block with the literal test("empty", ...) and delete it (or replace it
with a meaningful assertion for responses e2e if coverage is required) so the
test suite no longer contains a no-op test that adds noise.

112-245: Extend log assertions to multi-turn and tool-call flows.

Only single-turn asserts persisted logging. Adding validateLogByRequestId for both requests in multi-turn/tool-call flows would make the new coverage more complete for /v1/responses.

Proposed diff
@@
 			expect(secondRes.status).toBe(200);
 			const text = getOutputText(secondJson);
 			expect(text.toLowerCase()).toContain("ada");
+			await validateLogByRequestId(firstRequestId);
+			await validateLogByRequestId(secondRequestId);
 		},
 	);
@@
 			expect(
 				finalText.includes("sunny") ||
 					finalText.includes("72") ||
 					finalText.includes("weather"),
 			).toBe(true);
+			await validateLogByRequestId(firstRequestId);
+			await validateLogByRequestId(secondRequestId);
 		},
 	);
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses.e2e.ts` around lines 112 - 245, The tests
"responses multi-turn $model" and "responses tool calls $model" currently only
assert persisted logging for the final request; update both tests to call
validateLogByRequestId for each requestId (firstRequestId and secondRequestId)
after receiving and asserting the corresponding response. Specifically, in the
multi-turn test add validateLogByRequestId(firstRequestId) after asserting
firstRes and validateLogByRequestId(secondRequestId) after asserting secondRes;
in the tool-call test add validateLogByRequestId(firstRequestId) after the
firstRes assertions and validateLogByRequestId(secondRequestId) after the
secondRes assertions. Use the existing helpers generateTestRequestId,
postResponses and validateLogByRequestId to locate where to insert these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/gateway/src/responses.e2e.ts`:
- Around line 68-70: Remove the useless smoke test named test("empty") from
apps/gateway/src/responses.e2e.ts: locate the test block with the literal
test("empty", ...) and delete it (or replace it with a meaningful assertion for
responses e2e if coverage is required) so the test suite no longer contains a
no-op test that adds noise.
- Around line 112-245: The tests "responses multi-turn $model" and "responses
tool calls $model" currently only assert persisted logging for the final
request; update both tests to call validateLogByRequestId for each requestId
(firstRequestId and secondRequestId) after receiving and asserting the
corresponding response. Specifically, in the multi-turn test add
validateLogByRequestId(firstRequestId) after asserting firstRes and
validateLogByRequestId(secondRequestId) after asserting secondRes; in the
tool-call test add validateLogByRequestId(firstRequestId) after the firstRes
assertions and validateLogByRequestId(secondRequestId) after the secondRes
assertions. Use the existing helpers generateTestRequestId, postResponses and
validateLogByRequestId to locate where to insert these checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3fe31f2d-ec6d-47dd-8be8-eed0825e6c4e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b74294 and f9fde80.

📒 Files selected for processing (1)
  • apps/gateway/src/responses.e2e.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
apps/gateway/src/responses.e2e.ts (2)

84-86: Remove the placeholder empty test.

This assertion doesn’t validate gateway behavior and adds noise to E2E results.

🧹 Proposed cleanup
-	test("empty", () => {
-		expect(true).toBe(true);
-	});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses.e2e.ts` around lines 84 - 86, Remove the
placeholder test named "empty" in responses.e2e.ts: locate the test("empty",
...) block and delete it so the E2E suite contains only meaningful assertions;
ensure no other tests depend on this placeholder and run the test suite to
confirm no references remain to the "empty" test.

128-170: Add log-persistence assertions to multi-turn and tool-call flows too.

Single-turn already verifies logging; applying the same check here will catch regressions in chained/tool execution paths.

📌 Proposed additions
 			expect(secondRes.status).toBe(200);
 			const text = getOutputText(secondJson);
 			expect(text.toLowerCase()).toContain("ada");
+
+			await validateLogByRequestId(firstRequestId);
+			await validateLogByRequestId(secondRequestId);
 		},
 	);
@@
 			expect(
 				finalText.includes("sunny") ||
 					finalText.includes("72") ||
 					finalText.includes("weather"),
 			).toBe(true);
+
+			await validateLogByRequestId(firstRequestId);
+			await validateLogByRequestId(secondRequestId);
 		},
 	);

Also applies to: 173-260

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses.e2e.ts` around lines 128 - 170, The multi-turn
test (test.each over responsesTestModels) is missing the same log-persistence
assertions used in the single-turn test; add the identical checks after the
first and second response handling (i.e., after postResponses calls for firstRes
and secondRes) to verify the request/response were persisted to logs. Locate the
multi-turn flow around test.each, postResponses, generateTestRequestId and
getOutputText and mirror the single-turn assertions (check stored log entries or
API log endpoint and verify fields like request_id/response_id, model,
input/output text) so both turns and tool-call flows assert persisted logs just
like the single-turn test. Ensure the same helper/assertion utilities used in
single-turn are reused to keep behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/gateway/src/responses.e2e.ts`:
- Around line 84-86: Remove the placeholder test named "empty" in
responses.e2e.ts: locate the test("empty", ...) block and delete it so the E2E
suite contains only meaningful assertions; ensure no other tests depend on this
placeholder and run the test suite to confirm no references remain to the
"empty" test.
- Around line 128-170: The multi-turn test (test.each over responsesTestModels)
is missing the same log-persistence assertions used in the single-turn test; add
the identical checks after the first and second response handling (i.e., after
postResponses calls for firstRes and secondRes) to verify the request/response
were persisted to logs. Locate the multi-turn flow around test.each,
postResponses, generateTestRequestId and getOutputText and mirror the
single-turn assertions (check stored log entries or API log endpoint and verify
fields like request_id/response_id, model, input/output text) so both turns and
tool-call flows assert persisted logs just like the single-turn test. Ensure the
same helper/assertion utilities used in single-turn are reused to keep behavior
consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 042a7285-a41f-4b11-aed4-b0e49867bca5

📥 Commits

Reviewing files that changed from the base of the PR and between 22176ad and 7ae94b9.

📒 Files selected for processing (1)
  • apps/gateway/src/responses.e2e.ts

Two bugs in the Responses API conversion layer caused multi-turn tool-call
flows to fail on strict providers (deepseek, aws-bedrock, kimi, novita,
alibaba) with "assistant message with tool_calls must be followed by tool
messages":

1. convert-chat-to-responses emitted an empty `message` output item when
   providers returned content: "" alongside tool_calls. On replay this
   became a stray assistant message that separated tool_calls from the
   tool result.

2. convert-responses-to-chat split tool_calls and accompanying assistant
   text into two separate chat messages. They must live on the same
   assistant message in chat completions; otherwise the tool result no
   longer immediately follows the tool_calls assistant.

Also denylists bytedance/gpt-oss-120b in the e2e tool-call test: that
adapter does not emit stable tool_call ids across requests, which is a
separate provider-level issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
apps/gateway/src/responses.e2e.ts (1)

41-43: Filter the denylist before sampling one model per provider.

With the current order, a denylisted first model removes that provider from this suite entirely instead of falling back to another eligible model later in toolCallModels. Filtering first keeps the intended “one eligible model per provider” behavior stable as the list evolves.

♻️ Possible change
-const responsesToolCallModels = oneModelPerProvider(toolCallModels).filter(
-	(m) => !TOOL_CALL_DENYLIST.has(m.model),
-);
+const responsesToolCallModels = oneModelPerProvider(
+	toolCallModels.filter((m) => !TOOL_CALL_DENYLIST.has(m.model)),
+);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses.e2e.ts` around lines 41 - 43, The current code
calls oneModelPerProvider(toolCallModels) then filters out denylisted models,
which can drop a provider entirely if its first sampled model is denylisted;
instead filter toolCallModels by TOOL_CALL_DENYLIST first and then call
oneModelPerProvider so each provider is sampled from eligible models. Update the
expression that constructs responsesToolCallModels to pass oneModelPerProvider a
pre-filtered array (use toolCallModels.filter(m =>
!TOOL_CALL_DENYLIST.has(m.model))) so the sampling happens only over allowed
models.
apps/gateway/src/responses/tools/convert-chat-to-responses.ts (1)

115-125: Please pin this branch with a direct regression test.

The code change looks right, but the new E2E suite does not guarantee a provider will emit the exact content: "" + tool_calls shape that broke replay here. A small unit test around convertChatResponseToResponses() would make this regression deterministic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses/tools/convert-chat-to-responses.ts` around lines
115 - 125, Add a deterministic unit test to pin this regression by exercising
convertChatResponseToResponses(): construct a chat response where an assistant
message has content: "" (or whitespace) alongside tool_calls and assert that the
resulting responses do NOT include an empty assistant message item (i.e., the
empty/whitespace content is skipped) and that tool_calls and tool result linkage
(previous_response_id behavior) remain intact; ensure the test covers both empty
string and whitespace-only content cases so future E2E changes cannot
reintroduce the bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/responses.e2e.ts`:
- Around line 227-246: The test currently only checks types but not that
identifiers are present; before calling generateTestRequestId() and
postResponses(...) assert that firstJson.id and fnCall.call_id are non-empty
(e.g., use expect(firstJson.id).toBeTruthy() and
expect(fnCall?.call_id).toBeTruthy()) so the replay path is validated; add these
assertions just after obtaining fnCall from getFunctionCall(...) and before
using those values in the second request to ensure the test fails if the IDs are
missing.

---

Nitpick comments:
In `@apps/gateway/src/responses.e2e.ts`:
- Around line 41-43: The current code calls oneModelPerProvider(toolCallModels)
then filters out denylisted models, which can drop a provider entirely if its
first sampled model is denylisted; instead filter toolCallModels by
TOOL_CALL_DENYLIST first and then call oneModelPerProvider so each provider is
sampled from eligible models. Update the expression that constructs
responsesToolCallModels to pass oneModelPerProvider a pre-filtered array (use
toolCallModels.filter(m => !TOOL_CALL_DENYLIST.has(m.model))) so the sampling
happens only over allowed models.

In `@apps/gateway/src/responses/tools/convert-chat-to-responses.ts`:
- Around line 115-125: Add a deterministic unit test to pin this regression by
exercising convertChatResponseToResponses(): construct a chat response where an
assistant message has content: "" (or whitespace) alongside tool_calls and
assert that the resulting responses do NOT include an empty assistant message
item (i.e., the empty/whitespace content is skipped) and that tool_calls and
tool result linkage (previous_response_id behavior) remain intact; ensure the
test covers both empty string and whitespace-only content cases so future E2E
changes cannot reintroduce the bug.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ba29803f-3353-4fbb-8758-a931027e95ec

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae94b9 and 99c78f1.

📒 Files selected for processing (3)
  • apps/gateway/src/responses.e2e.ts
  • apps/gateway/src/responses/tools/convert-chat-to-responses.ts
  • apps/gateway/src/responses/tools/convert-responses-to-chat.ts
✅ Files skipped from review due to trivial changes (1)
  • apps/gateway/src/responses/tools/convert-responses-to-chat.ts

Comment on lines +227 to +246
expect(firstRes.status).toBe(200);
const fnCall = getFunctionCall(firstJson);
expect(fnCall).toBeDefined();
expect(fnCall?.name).toBe("get_weather");
expect(typeof fnCall?.call_id).toBe("string");
expect(typeof fnCall?.arguments).toBe("string");
const parsedArgs = JSON.parse(fnCall?.arguments ?? "{}");
expect(typeof parsedArgs.city).toBe("string");
expect(parsedArgs.city.toLowerCase()).toContain("san francisco");

const secondRequestId = generateTestRequestId();
const secondRes = await postResponses(
{
model,
previous_response_id: firstJson.id,
input: [
{
type: "function_call_output",
call_id: fnCall?.call_id,
output: "72F and sunny",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the replay IDs before sending turn two.

Right now this test only proves the identifiers are typed as strings. If firstJson.id or fnCall.call_id is empty/missing, the second request no longer validates the replay path as strongly as the test name suggests. Please assert both values are non-empty before using them.

🧪 Possible change
 			expect(firstRes.status).toBe(200);
+			expect(typeof firstJson.id).toBe("string");
+			expect(firstJson.id.startsWith("resp_")).toBe(true);
 			const fnCall = getFunctionCall(firstJson);
 			expect(fnCall).toBeDefined();
 			expect(fnCall?.name).toBe("get_weather");
 			expect(typeof fnCall?.call_id).toBe("string");
+			expect(fnCall?.call_id).toBeTruthy();
 			expect(typeof fnCall?.arguments).toBe("string");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expect(firstRes.status).toBe(200);
const fnCall = getFunctionCall(firstJson);
expect(fnCall).toBeDefined();
expect(fnCall?.name).toBe("get_weather");
expect(typeof fnCall?.call_id).toBe("string");
expect(typeof fnCall?.arguments).toBe("string");
const parsedArgs = JSON.parse(fnCall?.arguments ?? "{}");
expect(typeof parsedArgs.city).toBe("string");
expect(parsedArgs.city.toLowerCase()).toContain("san francisco");
const secondRequestId = generateTestRequestId();
const secondRes = await postResponses(
{
model,
previous_response_id: firstJson.id,
input: [
{
type: "function_call_output",
call_id: fnCall?.call_id,
output: "72F and sunny",
expect(firstRes.status).toBe(200);
expect(typeof firstJson.id).toBe("string");
expect(firstJson.id.startsWith("resp_")).toBe(true);
const fnCall = getFunctionCall(firstJson);
expect(fnCall).toBeDefined();
expect(fnCall?.name).toBe("get_weather");
expect(typeof fnCall?.call_id).toBe("string");
expect(fnCall?.call_id).toBeTruthy();
expect(typeof fnCall?.arguments).toBe("string");
const parsedArgs = JSON.parse(fnCall?.arguments ?? "{}");
expect(typeof parsedArgs.city).toBe("string");
expect(parsedArgs.city.toLowerCase()).toContain("san francisco");
const secondRequestId = generateTestRequestId();
const secondRes = await postResponses(
{
model,
previous_response_id: firstJson.id,
input: [
{
type: "function_call_output",
call_id: fnCall?.call_id,
output: "72F and sunny",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/responses.e2e.ts` around lines 227 - 246, The test currently
only checks types but not that identifiers are present; before calling
generateTestRequestId() and postResponses(...) assert that firstJson.id and
fnCall.call_id are non-empty (e.g., use expect(firstJson.id).toBeTruthy() and
expect(fnCall?.call_id).toBeTruthy()) so the replay path is validated; add these
assertions just after obtaining fnCall from getFunctionCall(...) and before
using those values in the second request to ensure the test fails if the IDs are
missing.

@rcogal rcogal requested review from smakosh and steebchen April 7, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants