fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError by octo-patch · Pull Request #768 · TheR1D/shell_gpt

octo-patch · 2026-04-25T02:41:08Z

Fixes #667

Problem

On Windows, piping binary content (e.g. git diff on a repository with binary files) through stdin causes Python's text-mode stdin reader to produce lone surrogate characters via the surrogateescape error handler. These surrogates propagate into the JSON payload sent to the API, and httpx raises:

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1297-1298: surrogates not allowed

This affects any workflow that pipes content containing non-UTF-8 bytes to sgpt, most commonly on Windows but potentially on other systems with non-UTF-8 locales.

Solution

After collecting all stdin lines into the stdin string, apply a single-pass sanitization:

stdin = stdin.encode("utf-8", errors="replace").decode("utf-8")

This replaces any lone surrogate characters with the Unicode replacement character (U+FFFD) before the string is incorporated into the prompt and eventually serialized to JSON.

Testing

Added test_stdin_with_surrogate_characters in tests/test_default.py which injects a mock stdin containing a lone surrogate (\udcff) and verifies:
1. The CLI exits with code 0 (no crash).
2. The message content passed to the LLM is valid UTF-8 (.encode("utf-8") does not raise).
All 30 existing tests continue to pass (pytest tests/ --ignore=tests/_integration.py).

fixes TheR1D#667) On Windows, piping binary content such as `git diff` output through stdin causes Python's text-mode stdin reader to produce lone surrogate characters (e.g. \udcff) via the surrogateescape error handler. These surrogates then propagate into the JSON payload, and httpx raises: UnicodeEncodeError: 'utf-8' codec can't encode characters: surrogates not allowed Fix: encode the collected stdin string with errors='replace' before building the prompt, replacing any lone surrogates with the Unicode replacement character. A regression test is included that injects a surrogate-bearing mock stdin and asserts both that the CLI exits cleanly and that the message passed to the LLM is valid UTF-8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError#768

fix: sanitize lone surrogates from stdin to prevent UnicodeEncodeError#768
octo-patch wants to merge 1 commit intoTheR1D:mainfrom
octo-patch:fix/issue-667-surrogate-unicode-error

octo-patch commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

octo-patch commented Apr 25, 2026

Problem

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant