Skip to content

fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError#768

Open
octo-patch wants to merge 1 commit intoTheR1D:mainfrom
octo-patch:fix/issue-667-surrogate-unicode-error
Open

fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError#768
octo-patch wants to merge 1 commit intoTheR1D:mainfrom
octo-patch:fix/issue-667-surrogate-unicode-error

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #667

Problem

On Windows, piping binary content (e.g. git diff on a repository with binary files) through stdin causes Python's text-mode stdin reader to produce lone surrogate characters via the surrogateescape error handler. These surrogates propagate into the JSON payload sent to the API, and httpx raises:

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1297-1298: surrogates not allowed

This affects any workflow that pipes content containing non-UTF-8 bytes to sgpt, most commonly on Windows but potentially on other systems with non-UTF-8 locales.

Solution

After collecting all stdin lines into the stdin string, apply a single-pass sanitization:

stdin = stdin.encode("utf-8", errors="replace").decode("utf-8")

This replaces any lone surrogate characters with the Unicode replacement character (U+FFFD) before the string is incorporated into the prompt and eventually serialized to JSON.

Testing

  • Added test_stdin_with_surrogate_characters in tests/test_default.py which injects a mock stdin containing a lone surrogate (\udcff) and verifies:
    1. The CLI exits with code 0 (no crash).
    2. The message content passed to the LLM is valid UTF-8 (.encode("utf-8") does not raise).
  • All 30 existing tests continue to pass (pytest tests/ --ignore=tests/_integration.py).

fixes TheR1D#667)

On Windows, piping binary content such as `git diff` output through stdin
causes Python's text-mode stdin reader to produce lone surrogate characters
(e.g. \udcff) via the surrogateescape error handler.  These surrogates then
propagate into the JSON payload, and httpx raises:

    UnicodeEncodeError: 'utf-8' codec can't encode characters: surrogates not allowed

Fix: encode the collected stdin string with errors='replace' before building
the prompt, replacing any lone surrogates with the Unicode replacement character.
A regression test is included that injects a surrogate-bearing mock stdin and
asserts both that the CLI exits cleanly and that the message passed to the LLM
is valid UTF-8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1297-1298: surrogates not allowed

1 participant