Devel-IPerl-Plugin-Perlbrew
view release on metacpan or search on metacpan
bin/perlbrewise-spec view on Meta::CPAN
C<Cpanel::JSON::XS> throws an exception.
=back
=head3 DESERIALIZATION
For deserialization there are only two cases to consider: either
nonstandard tagging was used, in which case C<allow_tags> decides,
or objects cannot be automatically be deserialized, in which
case you can use postprocessing or the C<filter_json_object> or
C<filter_json_single_key_object> callbacks to get some real objects our of
your JSON.
This section only considers the tagged value case: I a tagged JSON object
is encountered during decoding and C<allow_tags> is disabled, a parse
error will result (as if tagged values were not part of the grammar).
If C<allow_tags> is enabled, C<Cpanel::JSON::XS> will look up the C<THAW> method
of the package/classname used during serialization (it will not attempt
to load the package as a Perl module). If there is no such method, the
decoding will fail with an error.
Otherwise, the C<THAW> method is invoked with the classname as first
argument, the constant string C<JSON> as second argument, and all the
values from the JSON array (the values originally returned by the
C<FREEZE> method) as remaining arguments.
The method must then return the object. While technically you can return
any Perl scalar, you might have to enable the C<enable_nonref> setting to
make that work in all cases, so better return an actual blessed reference.
As an example, let's implement a C<THAW> function that regenerates the
C<My::Object> from the C<FREEZE> example earlier:
sub My::Object::THAW {
my ($class, $serializer, $type, $id) = @_;
$class->new (type => $type, id => $id)
}
See the L</SECURITY CONSIDERATIONS> section below. Allowing external
json objects being deserialized to perl objects is usually a very bad
idea.
=head1 ENCODING/CODESET FLAG NOTES
The interested reader might have seen a number of flags that signify
encodings or codesets - C<utf8>, C<latin1>, C<binary> and
C<ascii>. There seems to be some confusion on what these do, so here
is a short comparison:
C<utf8> controls whether the JSON text created by C<encode> (and expected
by C<decode>) is UTF-8 encoded or not, while C<latin1> and C<ascii> only
control whether C<encode> escapes character values outside their respective
codeset range. Neither of these flags conflict with each other, although
some combinations make less sense than others.
Care has been taken to make all flags symmetrical with respect to
C<encode> and C<decode>, that is, texts encoded with any combination of
these flag values will be correctly decoded when the same flags are used
- in general, if you use different flag settings while encoding vs. when
decoding you likely have a bug somewhere.
Below comes a verbose discussion of these flags. Note that a "codeset" is
simply an abstract set of character-codepoint pairs, while an encoding
takes those codepoint numbers and I<encodes> them, in our case into
octets. Unicode is (among other things) a codeset, UTF-8 is an encoding,
and ISO-8859-1 (= latin 1) and ASCII are both codesets I<and> encodings at
the same time, which can be confusing.
=over 4
=item C<utf8> flag disabled
When C<utf8> is disabled (the default), then C<encode>/C<decode> generate
and expect Unicode strings, that is, characters with high ordinal Unicode
values (> 255) will be encoded as such characters, and likewise such
characters are decoded as-is, no changes to them will be done, except
"(re-)interpreting" them as Unicode codepoints or Unicode characters,
respectively (to Perl, these are the same thing in strings unless you do
funny/weird/dumb stuff).
This is useful when you want to do the encoding yourself (e.g. when you
want to have UTF-16 encoded JSON texts) or when some other layer does
the encoding for you (for example, when printing to a terminal using a
filehandle that transparently encodes to UTF-8 you certainly do NOT want
to UTF-8 encode your data first and have Perl encode it another time).
=item C<utf8> flag enabled
If the C<utf8>-flag is enabled, C<encode>/C<decode> will encode all
characters using the corresponding UTF-8 multi-byte sequence, and will
expect your input strings to be encoded as UTF-8, that is, no "character"
of the input string must have any value > 255, as UTF-8 does not allow
that.
The C<utf8> flag therefore switches between two modes: disabled means you
will get a Unicode string in Perl, enabled means you get an UTF-8 encoded
octet/binary string in Perl.
=item C<latin1>, C<binary> or C<ascii> flags enabled
With C<latin1> (or C<ascii>) enabled, C<encode> will escape
characters with ordinal values > 255 (> 127 with C<ascii>) and encode
the remaining characters as specified by the C<utf8> flag.
With C<binary> enabled, ordinal values > 255 are illegal.
If C<utf8> is disabled, then the result is also correctly encoded in those
character sets (as both are proper subsets of Unicode, meaning that a
Unicode string with all character values < 256 is the same thing as a
ISO-8859-1 string, and a Unicode string with all character values < 128 is
the same thing as an ASCII string in Perl).
If C<utf8> is enabled, you still get a correct UTF-8-encoded string,
regardless of these flags, just some more characters will be escaped using
C<\uXXXX> then before.
Note that ISO-8859-1-I<encoded> strings are not compatible with UTF-8
encoding, while ASCII-encoded strings are. That is because the ISO-8859-1
encoding is NOT a subset of UTF-8 (despite the ISO-8859-1 I<codeset> being
a subset of Unicode), while ASCII is.
Surprisingly, C<decode> will ignore these flags and so treat all input
values as governed by the C<utf8> flag. If it is disabled, this allows you
to decode ISO-8859-1- and ASCII-encoded strings, as both strict subsets of
Unicode. If it is enabled, you can correctly decode UTF-8 encoded strings.
So neither C<latin1>, C<binary> nor C<ascii> are incompatible with the
C<utf8> flag - they only govern when the JSON output engine escapes a
character or not.
The main use for C<latin1> or C<binary> is to relatively efficiently
store binary data as JSON, at the expense of breaking compatibility
with most JSON decoders.
The main use for C<ascii> is to force the output to not contain characters
with values > 127, which means you can interpret the resulting string
as UTF-8, ISO-8859-1, ASCII, KOI8-R or most about any character set and
bin/perlbrewise-spec view on Meta::CPAN
JSON syntax is based on how literals are represented in javascript (the
not-standardized predecessor of ECMAscript) which is presumably why it is
called "JavaScript Object Notation".
However, JSON is not a subset (and also not a superset of course) of
ECMAscript (the standard) or javascript (whatever browsers actually
implement).
If you want to use javascript's C<eval> function to "parse" JSON, you
might run into parse errors for valid JSON texts, or the resulting data
structure might not be queryable:
One of the problems is that U+2028 and U+2029 are valid characters inside
JSON strings, but are not allowed in ECMAscript string literals, so the
following Perl fragment will not output something that can be guaranteed
to be parsable by javascript's C<eval>:
use Cpanel::JSON::XS;
print encode_json [chr 0x2028];
The right fix for this is to use a proper JSON parser in your javascript
programs, and not rely on C<eval> (see for example Douglas Crockford's
F<json2.js> parser).
If this is not an option, you can, as a stop-gap measure, simply encode to
ASCII-only JSON:
use Cpanel::JSON::XS;
print Cpanel::JSON::XS->new->ascii->encode ([chr 0x2028]);
Note that this will enlarge the resulting JSON text quite a bit if you
have many non-ASCII characters. You might be tempted to run some regexes
to only escape U+2028 and U+2029, e.g.:
# DO NOT USE THIS!
my $json = Cpanel::JSON::XS->new->utf8->encode ([chr 0x2028]);
$json =~ s/\xe2\x80\xa8/\\u2028/g; # escape U+2028
$json =~ s/\xe2\x80\xa9/\\u2029/g; # escape U+2029
print $json;
Note that I<this is a bad idea>: the above only works for U+2028 and
U+2029 and thus only for fully ECMAscript-compliant parsers. Many existing
javascript implementations, however, have issues with other characters as
well - using C<eval> naively simply I<will> cause problems.
Another problem is that some javascript implementations reserve
some property names for their own purposes (which probably makes
them non-ECMAscript-compliant). For example, Iceweasel reserves the
C<__proto__> property name for its own purposes.
If that is a problem, you could parse try to filter the resulting JSON
output for these property strings, e.g.:
$json =~ s/"__proto__"\s*:/"__proto__renamed":/g;
This works because C<__proto__> is not valid outside of strings, so every
occurrence of C<"__proto__"\s*:> must be a string used as property name.
Unicode non-characters between U+FFFD and U+10FFFF are decoded either
to the recommended U+FFFD REPLACEMENT CHARACTER (see Unicode PR #121:
Recommended Practice for Replacement Characters), or in the binary or
relaxed mode left as is, keeping the illegal non-characters as before.
Raw non-Unicode characters outside the valid unicode range fail now to
parse, because "A string is a sequence of zero or more Unicode
characters" RFC 7159 section 1 and "JSON text SHALL be encoded in
Unicode RFC 7159 section 8.1. We use now the UTF8_DISALLOW_SUPER
flag when parsing unicode.
If you know of other incompatibilities, please let me know.
=head2 JSON and YAML
You often hear that JSON is a subset of YAML. I<in general, there is
no way to configure JSON::XS to output a data structure as valid YAML>
that works in all cases. If you really must use Cpanel::JSON::XS to
generate YAML, you should use this algorithm (subject to change in
future versions):
my $to_yaml = Cpanel::JSON::XS->new->utf8->space_after (1);
my $yaml = $to_yaml->encode ($ref) . "\n";
This will I<usually> generate JSON texts that also parse as valid
YAML.
=head2 SPEED
It seems that JSON::XS is surprisingly fast, as shown in the following
tables. They have been generated with the help of the C<eg/bench> program
in the JSON::XS distribution, to make it easy to compare on your own
system.
JSON::XS is with L<Data::MessagePack> and L<Sereal> one of the fastest
serializers, because JSON and JSON::XS do not support backrefs (no
graph structures), only trees. Storable supports backrefs,
i.e. graphs. Data::MessagePack encodes its data binary (as Storable)
and supports only very simple subset of JSON.
First comes a comparison between various modules using
a very short single-line JSON string (also available at
L<http://dist.schmorp.de/misc/json/short.json>).
{"method": "handleMessage", "params": ["user1",
"we were just talking"], "id": null, "array":[1,11,234,-5,1e5,1e7,
1, 0]}
It shows the number of encodes/decodes per second (JSON::XS uses
the functional interface, while Cpanel::JSON::XS/2 uses the OO interface
with pretty-printing and hash key sorting enabled, Cpanel::JSON::XS/3 enables
shrink. JSON::DWIW/DS uses the deserialize function, while JSON::DWIW::FJ
uses the from_json method). Higher is better:
module | encode | decode |
--------------|------------|------------|
JSON::DWIW/DS | 86302.551 | 102300.098 |
JSON::DWIW/FJ | 86302.551 | 75983.768 |
JSON::PP | 15827.562 | 6638.658 |
( run in 0.890 second using v1.01-cache-2.11-cpan-98e64b0badf )