Skip to content

Conversation

@amolmjoshi93
Copy link
Contributor

@amolmjoshi93 amolmjoshi93 commented Jan 5, 2026

Fixes production Encoding::CompatibilityError in rack/query_parser.rb
Closes #2031

Summary by CodeRabbit

  • Bug Fixes

    • Prevented encoding errors from malformed or non‑UTF‑8 request data by adding request sanitization early in request processing; query strings and request bodies are now normalized to valid UTF‑8 to avoid encoding-related failures.
  • Tests

    • Added comprehensive tests for encoding sanitization covering multiple encodings, invalid byte sequences, and request-body/stream behaviors.

Fixes production Encoding::CompatibilityError in rack/query_parser.rb
@coderabbitai
Copy link

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

Adds a new Rails middleware, EncodingSanitizer, that sanitizes request environment strings and wraps rack.input to coerce or scrub non-UTF-8 data into valid UTF-8 before downstream middlewares run.

Changes

Cohort / File(s) Summary
EncodingSanitizer Middleware
config/initializers/encoding_sanitizer.rb
New EncodingSanitizer middleware with initialize(app) and call(env). Sanitizes QUERY_STRING, REQUEST_URI, PATH_INFO, HTTP_REFERER; wraps env['rack.input'] with SanitizedInput (subclass of SimpleDelegator) that ensures read, gets, and each yield UTF-8-safe data. Inserted before ActionDispatch::Static (runs prior to Rack::MethodOverride).
Middleware Test Suite
spec/middleware/encoding_sanitizer_spec.rb
New RSpec coverage for query string and POST body sanitization (UTF-8 pass-through, UTF-16LE conversion, invalid byte handling, nil envs), SanitizedInput behaviors (read, gets, each, rewind, close), and middleware ordering assertions.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant EncodingSanitizer as "EncodingSanitizer\n(middleware)"
  participant ActionDispatch as "ActionDispatch::Static"
  participant MethodOverride as "Rack::MethodOverride"
  participant RackInput as "rack.input\n(SanitizedInput)"
  participant App

  Client->>EncodingSanitizer: HTTP request (env, raw body)
  note right of EncodingSanitizer: sanitize env keys\nwrap env['rack.input'] -> SanitizedInput
  EncodingSanitizer->>ActionDispatch: forward sanitized env
  ActionDispatch->>MethodOverride: forward env
  MethodOverride->>RackInput: read POST body (read/each/gets)
  RackInput-->>MethodOverride: sanitized UTF-8 chunks
  MethodOverride->>App: forward parsed request
  App-->>Client: HTTP response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop through bytes both odd and small,
Turn strange encodings tidy and all,
Replace the broken, mend each line,
Make UTF‑8 neat, one nibble at a time,
Now requests sleep safe in my warren of code.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding encoding error handling for UTF-16LE in Rack request parsing via a middleware.
Linked Issues check ✅ Passed All requirements from issue #2031 are met: EncodingSanitizer middleware sanitizes QUERY_STRING, REQUEST_URI, PATH_INFO, HTTP_REFERER and wraps rack.input to handle UTF-16LE data, converting to valid UTF-8.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing issue #2031: the middleware implementation and comprehensive test coverage with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 2031-rack-throws-encodingcompatibilityerror-incompatible-character-encodings-utf-16le-and-utf-8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents 🤖
In @config/initializers/encoding_sanitizer.rb:
- Around line 36-42: In the force_utf8 method, remove the redundant call to
force_encoding(Encoding::UTF_8) on the success path since
encode(Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "") already
returns a UTF-8 string; keep the rescue branch as-is
(value.dup.force_encoding(Encoding::UTF_8).scrub("")) to handle encoding errors.
Ensure you only delete the trailing .force_encoding(Encoding::UTF_8) after the
encode call in force_utf8 and run tests to confirm behavior remains correct.
🧹 Nitpick comments (4)
config/initializers/encoding_sanitizer.rb (2)

11-23: Consider adding error handling around sanitization logic.

The middleware performs encoding operations without rescue blocks in the call method. If an unexpected encoding error occurs during sanitization that isn't caught by the rescue clauses in force_utf8, it could cause the middleware to raise an exception and crash the request. Consider wrapping the sanitization logic in a top-level rescue block to ensure the middleware degrades gracefully.

🔎 Proposed defensive error handling
 def call(env)
+  begin
     # Sanitize URL-related env vars
     %w[QUERY_STRING REQUEST_URI PATH_INFO HTTP_REFERER].each do |key|
       sanitize_encoding(env, key)
     end
 
     # Wrap rack.input to sanitize POST body
     if env["rack.input"]
       env["rack.input"] = SanitizedInput.new(env["rack.input"])
     end
+  rescue => e
+    # Log error but don't crash the request
+    Rails.logger.error("EncodingSanitizer error: #{e.message}")
+  end
 
   @app.call(env)
 end

45-75: Consider implementing method delegation for complete rack.input compatibility.

The SanitizedInput wrapper implements read, gets, each, rewind, and close, but rack.input may have additional methods like size, pos, pos=, eof?, string, etc. that downstream Rack middleware or parsers might expect. Missing these could cause NoMethodError exceptions.

🔎 Proposed enhancement using SimpleDelegator
+require 'delegate'
+
 # Wrapper for rack.input that sanitizes encoding on read
-class SanitizedInput
+class SanitizedInput < SimpleDelegator
   def initialize(input)
-    @input = input
+    super(input)
   end
 
   def read(*args)
-    data = @input.read(*args)
+    data = __getobj__.read(*args)
     return data unless data.is_a?(String)
 
     sanitize(data)
   end
 
   def gets(*args)
-    data = @input.gets(*args)
+    data = __getobj__.gets(*args)
     return data unless data.is_a?(String)
 
     sanitize(data)
   end
 
   def each(&block)
-    @input.each { |line| block.call(sanitize(line)) }
-  end
-
-  def rewind
-    @input.rewind
-  end
-
-  def close
-    @input.close if @input.respond_to?(:close)
+    __getobj__.each { |line| block.call(sanitize(line)) }
   end

This ensures all other methods are delegated automatically.

spec/middleware/encoding_sanitizer_spec.rb (2)

25-41: Verify content preservation after encoding conversion.

The test confirms that UTF-16LE is converted to valid UTF-8, but doesn't verify that the actual content is preserved correctly. Consider adding an assertion to check the decoded value matches the original semantic content.

🔎 Proposed enhancement
 it "converts to valid UTF-8" do
   # Simulate UTF-16LE encoded string
   utf16_string = "test=value".encode(Encoding::UTF_16LE)
   env = {
     "QUERY_STRING" => utf16_string,
     "REQUEST_URI" => "/test",
     "PATH_INFO" => "/test"
   }
 
   status, response_env, _body = middleware.call(env)
 
   expect(status).to eq(200)
   expect(response_env["QUERY_STRING"].encoding).to eq(Encoding::UTF_8)
   expect(response_env["QUERY_STRING"]).to be_valid_encoding
+  expect(response_env["QUERY_STRING"]).to include("test")
+  expect(response_env["QUERY_STRING"]).to include("value")
 end

73-132: Add test coverage for gets and close methods.

The test suite covers read, rewind, and each, but the SanitizedInput class also implements gets and close methods that lack test coverage. Adding tests for these methods would ensure complete coverage of the wrapper's behavior.

🔎 Proposed additional tests
describe "#gets" do
  it "sanitizes line-by-line reads" do
    body = "line1\nline2"
    input = StringIO.new(body)
    sanitized = described_class.new(input)

    first_line = sanitized.gets
    expect(first_line).to be_valid_encoding
    expect(first_line).to eq("line1\n")
  end
end

describe "#close" do
  it "delegates close to underlying input" do
    input = StringIO.new("test")
    sanitized = described_class.new(input)

    expect(input).to receive(:close)
    sanitized.close
  end
end
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc587a0 and 92a6223.

📒 Files selected for processing (2)
  • config/initializers/encoding_sanitizer.rb
  • spec/middleware/encoding_sanitizer_spec.rb
🧰 Additional context used
🧬 Code graph analysis (1)
spec/middleware/encoding_sanitizer_spec.rb (1)
config/initializers/encoding_sanitizer.rb (4)
  • call (11-23)
  • read (50-55)
  • rewind (68-70)
  • each (64-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Running tests
🔇 Additional comments (7)
config/initializers/encoding_sanitizer.rb (4)

1-9: LGTM!

Standard middleware initialization pattern with clear documentation.


27-34: LGTM!

Efficient early returns avoid unnecessary sanitization for already-valid UTF-8 strings.


90-93: LGTM!

Correct insertion point ensures encoding is sanitized before downstream middleware parses the request.


78-86: This inconsistency is by design and not a bug.

The two methods use different encoding strategies intentionally based on their input sources. The force_utf8 method handles URL-related environment variables that are already decoded as strings by the web server, while the sanitize method handles raw bytes from rack.input. The ASCII_8BIT intermediate step in sanitize is a standard Ruby pattern for safely handling unknown or mixed binary data from the network, and this approach is correctly applied to the riskier input type. Both methods share identical error handling with fallback to force_encoding(Encoding::UTF_8).scrub(""). The comment on line 81 explicitly documents this intentional strategy: "Force to binary first, then encode to UTF-8."

Likely an incorrect or invalid review comment.

spec/middleware/encoding_sanitizer_spec.rb (3)

1-23: LGTM!

Well-structured test setup. The mock app returning env as the response body enables easy verification of sanitization effects.


43-71: LGTM!

Good coverage of edge cases including invalid byte sequences and nil values. These tests ensure robustness in production scenarios.


134-150: LGTM! Middleware ordering is correctly verified.

These tests ensure the middleware is positioned correctly to sanitize encoding before other middleware parses the request. Note that these tests depend on the presence of ActionDispatch::Static and Rack::MethodOverride in the middleware stack.

…ncodingcompatibilityerror-incompatible-character-encodings-utf-16le-and-utf-8
…ncodingcompatibilityerror-incompatible-character-encodings-utf-16le-and-utf-8
- Add error handling in call method to prevent request crashes
- Refactor SanitizedInput to use SimpleDelegator for better compatibility
- Add test coverage for gets and close methods
- Verify content preservation in UTF-16LE encoding test
- Ensure all rack.input methods are properly delegated

All 13 tests passing with no diagnostics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rack throws `Encoding::CompatibilityError: incompatible character encodings: UTF-16LE and UTF-8

2 participants