Adopting simdjson for JSON::XS?

Peter Juhasz peter.juhasz at comnica.com
Sun Mar 19 19:58:19 CET 2023


Hi,

would you be interested in a patch that adds the simdjson library as an
alternative decoder to JSON::XS?

Simdjson is a relatively recent C++ library that promises fast JSON
parsing by the virtue of using SIMD instructions found in recent CPUs (
 https://github.com/simdjson/simdjson ). When looking at their list of
language bindings, a certain language is conspicuously missing, so I
wanted to remedy that. I used JSON::XS as a basis because it's a module
I rely on at work and I wanted to keep its interface.

I have some preliminary results:

      test_case     length   original   simdjson  diff%
------------------------------------------------------------
      long.json      18446   10888.62   13232.14  21.52
  longkeys.json   10000302      50.87     141.65 178.45
     short.json        130  521287.77  528995.26   1.48
   twitter.json     631515     254.92     375.07  47.14

Here, short.json and long.json are your test cases, twitter.json comes
from the simdjson example directory and longkeys.json is a silly thing
I've generated with the command `perl -MJSON::XS -le 'my $k = "a" x
1e5; my $x={}; for (1..50) {$x->{$k} = $k; $k++;} print encode_json($x)
' > longkeys.json`. Length is in bytes and the numbers in the original
and simdjson columns are the number of decodes per second.

For short or structure-heavy inputs the cost of housekeeping and
allocation of Perl structures may erase the benefits, but for documents
with long keys (especially with lots of unicode and escaped characters)
the speedup is very real.

There are some incompatibilities and drawbacks, though:

- allow_tags is not and cannot be supported by this parser, because
simdjson is a strict JSON parser
- similarly, relaxed mode is not supported
- object filtering is not supported (though perhaps it can be managed)
- incremental parsing is not supported, since the entire JSON document
must be in memory for simdjson (but it has a facility to decode a
stream of more than one documents e.g. NDJSON, I haven't explored that)
- the error messages are different
- it's C++ so compilation times are longer (but this doesn't affect
users)
- it's C++ so it's disgusting and needs a recent compiler
- it uses more memory

However, parsing well-formed documents works and produces output
identical to legacy mode, including decode_prefix mode. For now, I've
made the new parsing mode opt-in with a new, explicit switch (which
means that it's available only with the OO interface).

Now, if you are not interested in incorporating this functionality into
JSON::XS (and given that the library is C++, I suspect you won't be),
would you object to a fork? I'm reluctant to publish a forked version
though, the world doesn't need yet another Perl JSON module.

And finally, a partially related nitpick that may be regarded as a bug:
the input '11111111111111111111111e1111111111111111111' is parsed as 0
by JSON::XS. The documentation does say that "[n]umbers containing a
fractional or exponential part will always be represented as numeric
(floating point) values, possibly at a loss of precision", but in this
case all of the precision is lost. Perhaps it would be better to decode
floating point numbers that don't fit into a double as strings, because
round-tripping ability is lost anyway but less information is lost that
way. (Admittedly, this is an extreme edge case.)

best regards,
Peter Juhasz



More information about the perl mailing list