Add a utf8() mode that allows byte/UTF-8 strings as input & output. #78
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
MAINTAINER: See what you think of this. I’ll add documentation updates if you’re amenable to the change itself.
JSON::PP has a number of options that indicate a desire to facilitate different applications’ nonstandard needs. For example, latin1() caters to applications that use Latin-1 encoding rather than UTF-8, which violates the JSON specification.
Some nontrivial Perl applications forgo character decoding. Their authors/maintainers may not know “perlunitut”’s recommended workflow, or the application may simply not care about Unicode. Either way, in such applications it’s ideal for a JSON encoder & decoder to forgo the usual UTF-8 decode/encode steps.
utf8(0) almost achieves this. It falls over, though, if the JSON document contains a Unicode character escape (e.g., "\u00e9"), which JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the decode logic: "é" in UTF-8 will yield a different result from "\u00e9".
Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ), but that falls over if applications need to allow non-UTF-8 sequences in JSON inputs.
In short, a need exists for this Perl string:
… to decode to "\xff\xc3\xa9\xc3\xa9".
This changeset adds a solution to this problem by changing utf8() from a simple flag to an enum: the existing chars-in-chars-out (0) and bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option. Named constants are added to avoid “magic numbers”.