fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError#768
Open
octo-patch wants to merge 1 commit intoTheR1D:mainfrom
Open
fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError#768octo-patch wants to merge 1 commit intoTheR1D:mainfrom
octo-patch wants to merge 1 commit intoTheR1D:mainfrom
Conversation
fixes TheR1D#667) On Windows, piping binary content such as `git diff` output through stdin causes Python's text-mode stdin reader to produce lone surrogate characters (e.g. \udcff) via the surrogateescape error handler. These surrogates then propagate into the JSON payload, and httpx raises: UnicodeEncodeError: 'utf-8' codec can't encode characters: surrogates not allowed Fix: encode the collected stdin string with errors='replace' before building the prompt, replacing any lone surrogates with the Unicode replacement character. A regression test is included that injects a surrogate-bearing mock stdin and asserts both that the CLI exits cleanly and that the message passed to the LLM is valid UTF-8.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #667
Problem
On Windows, piping binary content (e.g.
git diffon a repository with binary files) through stdin causes Python's text-mode stdin reader to produce lone surrogate characters via thesurrogateescapeerror handler. These surrogates propagate into the JSON payload sent to the API, and httpx raises:This affects any workflow that pipes content containing non-UTF-8 bytes to
sgpt, most commonly on Windows but potentially on other systems with non-UTF-8 locales.Solution
After collecting all stdin lines into the
stdinstring, apply a single-pass sanitization:This replaces any lone surrogate characters with the Unicode replacement character (
U+FFFD) before the string is incorporated into the prompt and eventually serialized to JSON.Testing
test_stdin_with_surrogate_charactersintests/test_default.pywhich injects a mock stdin containing a lone surrogate (\udcff) and verifies:.encode("utf-8")does not raise).pytest tests/ --ignore=tests/_integration.py).