Skip to content

Conversation

@FGasper
Copy link
Contributor

@FGasper FGasper commented Sep 7, 2022

MAINTAINER: See what you think of this. I’ll add documentation updates if you’re amenable to the change itself.


JSON::PP has a number of options that indicate a desire to facilitate different applications’ nonstandard needs. For example, latin1() caters to applications that use Latin-1 encoding rather than UTF-8, which violates the JSON specification.

Some nontrivial Perl applications forgo character decoding. Their authors/maintainers may not know “perlunitut”’s recommended workflow, or the application may simply not care about Unicode. Either way, in such applications it’s ideal for a JSON encoder & decoder to forgo the usual UTF-8 decode/encode steps.

utf8(0) almost achieves this. It falls over, though, if the JSON document contains a Unicode character escape (e.g., "\u00e9"), which JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the decode logic: "é" in UTF-8 will yield a different result from "\u00e9".

Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ), but that falls over if applications need to allow non-UTF-8 sequences in JSON inputs.

In short, a need exists for this Perl string:

qq<"\xff\xc3\xa9\xc3\xa9\\u00e9">

… to decode to "\xff\xc3\xa9\xc3\xa9".

This changeset adds a solution to this problem by changing utf8() from a simple flag to an enum: the existing chars-in-chars-out (0) and bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option. Named constants are added to avoid “magic numbers”.

JSON::PP has a number of options that indicate a desire to facilitate
different applications’ nonstandard needs. For example, latin1() caters
to applications that use Latin-1 encoding rather than UTF-8, which
violates the JSON specification.

Some nontrivial Perl applications forgo character decoding. Their
authors/maintainers may not know “perlunitut”’s recommended workflow,
or the application may simply not care about Unicode. Either way, in
such applications it’s ideal for a JSON encoder & decoder to forgo
the usual UTF-8 decode/encode steps.

utf8(0) almost achieves this. It falls over, though, if the JSON
document contains a Unicode character escape (e.g., "\u00e9"), which
JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the
decode logic: "é" in UTF-8 will yield a different result from "\u00e9".

Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ),
but that falls over if applications need to allow non-UTF-8 sequences
in JSON inputs.

In short, a need exists for this Perl string:

    qq<"\xff\xc3\xa9\xc3\xa9\\u00e9">

… to decode to "\xff\xc3\xa9\xc3\xa9".

This changeset adds a solution to this problem by changing utf8() from
a simple flag to an enum: the existing chars-in-chars-out (0) and
bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option.
Named constants are added to avoid “magic numbers”.
@charsbar
Copy link
Collaborator

charsbar commented Sep 7, 2022

I understand your point but if the change breaks compatibility with JSON::XS, it's unacceptable because JSON::PP is basically a fallback module of it. I am also reluctant to add a new mode if it's for JSON::PP only. Could you discuss this with the JSON::XS maintainer first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants