JSON5

To paraphrase that old adage about writing cryptographic libraries:

The first rule of writing a JSON library is that you do not write a JSON library.

—Wiser men than us

Of course, we are not wise but, armed with our limited knowledge, we are dangerous.

In truth, I didn’t want to write a JSON library, isn’t the point that someone else has written one already? Indeed, what’s the pressing reason for getting involved with JSON at all?

Well, for good or for ill, JSON has become a de facto data interchange format for REST-oriented systems. And we’ll probably want to talk to REST-oriented systems at some point so our card is marked.

JSON itself, now RFC8259, appears in the guise of a machine data interchange format except that, as is the way, it has become a configuration format and is obligingly human-editable.

Humans are rubbish, though, and they:

  • like to throw in a comment or two as aide-mémoire and, despite an early iteration that did support comments, JSON does not support comments

  • like to delete lines of configuration meaning that trailing commas become a thing which is illegal in JSON

  • like to use regular identifiers as the names in JSON objects rather than strings.

So JSON5 was created to accommodate these things and some other fettling around unrepresentable numbers that JSON explicitly denies.

So, job #1, find a JSON5 library.

In my first pass I could see that JSON5 involves Unicode and so if the JSON5 library doesn’t make heavy use of Unicode then it isn’t likely to be correct. Hmm, that rules a few out.

You also notice that whilst JSON5 is read in, by and large, only JSON is printed back out. I don’t think that is hugely unreasonable but it is interesting.

I then hit upon Simon Schoenenberger’s standalone C library for JSON5. Slightly more importantly, is his Unicode character lookup table work. You can read more about my re-imagining in Unicode Summary Information and its changes to how Unicode is handled in Idio.

Parser

Enthused, I immediately broke rule one, above, and started writing a JSON5 parser which turns out, thanks to the use of ECMAScript Identifiers, to be far more tiresome than hoped.

As an interesting aside, ECMAScript lets you use Unicode characters in Identifier names which is good, right? Even more interestingly, it lets you use Unicode Escape Sequences in Identifier names: \u00337 is the, otherwise illegal, Identifier 37. Illegal because an Identifier must start with (broadly) a “Letter” but can also start with a Unicode Escape Sequence.

It also has a throwback to JavaScript’s UTF-16 roots in that such Unicode Escape Sequences can only encode code points in the Basic Multilingual Plane (code points up to and including U+FFFF). Above that you must use the UTF-16 high-surrogate + low surrogate two step shuffle.

In the end, though, we have a passable, inefficient, JSON5 parser which, doubtless, fails in extremis.

The code is set up to be used standalone or as part of an extension to Idio.

As it can be used standalone, the code enforces C’s numeric limitations.

Limits

The JSON[5] format is a bit vague on limits, particularly, numbers.

Should there be any limits? Maybe? It’s meant to be a machine data interchange format after all!

There does appear to be tacit acceptance that in practice C might be a limiting factor. This is lightly addressed in RFC8259 noting merely that by constraining output to such 64-bit formats:

good interoperability can be achieved by implementations that expect no more precision or range than these provide

Integers

Integers, ultimately, ECMAScript NumericLiterals, are an unbounded sequence of decimal or hexadecimal digits and the decimal variant can have a signed exponent of an equally arbitrary number of digits.

In practice, though, my suspicion is virtually everything interacting with JSON is 64-bit bounded. What should we make of the withExponent example of 123e-456?

The current code handles that badly, I calculate, *ahem*, 0, as the JSON5 processing part stores decimal numbers as a C int64_t and shifting 123 by 456 orders of magnitude doesn’t leave much behind.

The JSON RFC implies the use of IEEE 754 binary64 double precision numbers even for integers with the effect that integers are bounded to, roughly, +/- 253.

Of note, IEEE 754 binary64 supports slightly fewer (significant) decimal digits, 15-17, in its mantissa than a native C’s int64_t.

Of course, Bignums are a thing but are my bignums anything like your bignums? We defined Idio to support arbitrary significant digits (albeit they get normalised to 18 frequently) and a 32-bit exponent. What do your bignums support, assuming you have any? How can we share information reliably?

We’re back to wishy-washy commentary from the RFC:

A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

I can’t help but think that the standard of a machine data interchange format ought to, maybe, define some limits. Offer some options for extensibility, maybe?

Floating Point Numbers

Again, many systems will be using C’s double which is limited to exponents of +/- 340 or so which won’t stand up to the arbitrary precision possible in JSON5.

Interestingly, that means we couldn’t have stored 123e-456 as a floating point number either!

Implementation

Reading

JSON5 is designed for data interchange so the ultimate goal is for us to extract our local interpretation of the textual message.

In the very first instance, I am assuming that the input stream is UTF-8 and I’ve chosen to use the same multi-width technique as for Idio strings meaning I can extract individual code points without subsequent re-interpretation of the UTF-8 input stream.

This means we’re already facing a double memory allocation. In the first instance we need to allocate memory for the UTF-8 input stream and then we allocate memory for a multi-width array of code points based on that input stream.

In order to make it standalone it falls back to C limitations on numbers, especially.

The “unicode string” can then be tokenized using the JSON5 grammar. This is where the ECMAScript rules for Identifiers and Numbers come into play. The result of this should be a chain of JSON5 tokens (identifiers, punctuation, strings and numbers) each of which has an associated JSON5 value.

We can then parse the chain of tokens to validate the input stream from which we should be able to derive a single JSON5 value – the bound up collection of values.

No tokens is a fail and any left over tokens is a fail, of course.

The usual JSON[5] we see is an object, { ... }, although the shortest valid JSON5 could be a single digit decimal number, {}, [] or ""/'' with NaN a close contender.

Strings (and identifiers) are reasonably straight-forward although you need to be leery of the various escape sequences allowed thus creating the more tiresome act of validating the input stream.

ECMAScript Identifiers can have

  • UnicodeEscapeSequences in them, \uHHHH

    including the UTF-16 high-surrogate + low-surrogate pairs

whereas strings can have any of the more common

  • C-style, \n

    including the \q for non-special code points meaning q

  • HexEscapeSequences, \xHH

  • the same UnicodeEscapeSequences as for Identifiers

  • and escaped LineTerminatorSequences as JSON5 users get to experience multi-line strings (albeit with the LineContinuation bodge). Woo!

What this means is that we have to reallocate identifiers and strings again to accommodate any escape sequences. Technically, an escape sequence will result in a shorter string – as two code point \\, the four code point Hex- and six or twelve code point Unicode-EscapeSequences reduce to one code point – but the chances are the escape sequence is generating a wider character than the (commonly) ASCII single byte width source text, so the whole identifier/string value needs widening.

Given that JSON5 identifiers are only used as alternatives to strings for the “member names” in objects (and differ in the rules about their construction and escape sequences) then treating them subsequently as strings seems to be OK.

Number values, as noted, are limited to C’s int64_t and double.

Overall, then, we’re left with a json5_value_t * to do something with.

Into Idio

Given that JSON5 allows for some literals, strings, numbers, objects and arrays then, I think, we have a reasonably straight-forward translation into Idio symbols, strings, numbers, hash tables and arrays.

There doesn’t appear to be any requirements for those JSON5 values that lie beyond those we use in Idio ourselves other than constraints on numbers. Hopefully, Idio’s limits lie beyond C’s so we should be able to accommodate anything in a json5_value_t.

Of course, the downside of going via C values is that we could have accommodated a wider set of JSON5 numbers directly, largely as Bignums, although I’m struggling to see a use case.

I’m thinking of the 123e-456 example which requires that the sender have some expectation that the receiver can store such a value. Does the sender know the capabilities of the receiver? It’s not like there was a HTTP- or SSH-style exchange of headers to agree a set of acceptable limits.

In the meanwhile, we get to allocate strings again as, whilst the multi-width “unicode string” format is the same, the json5_value_t is likely to get freed shortly after conversion into an Idio string so we can’t be lazy.

Numbers, of course, will need the “full” conversion into fixnums and bignums.

Writing

Going the other way, generating JSON5 from an Idio value is slightly less traumatic albeit we hit the JSON5/JSON conundrum.

What is the receiver expecting? We’re not in the HTTP-negotiation loop so we don’t know what format is being agreed. In the first instance, I think we can only safely generate JSON, like everyone else.

That’s hardly the end of the world, the JSON5 format was constructed for the convenience of humans not machines albeit legal JSON5 terms are not necessarily legal JSON terms.

In the first instance, there is no need to translate the Idio value into json5_value_t values, we merely need to validate it then print it. Validation being that the value consists of JSON elements, largely, objects and arrays, although there’s that hinted at nuance.

If we read in JSON5, NaN is a valid value. It is not valid JSON – which only accepts null, true and false literals. JSON5’s NaN could become JSON’s "NaN" with attendant mis-interpretation – did I mean the IEEE-754 NaN or the (ostensibly random three-letter) string "NaN"?

We could, of course, claim that NaN is invalid JSON – which it is – which means that the read/tweak/write loop becomes the more risky read/tweak/barf loop.

I’m more tempted to fail as there is some legitimacy in failing to generate JSON than there is to second-guess how to mangle JSON5’s extras into JSON.

An alternative is to upgrade the output to JSON5 if we see any JSON5-specific elements and hope that the next guy is JSON5-aware.

I guess the defining approach would be to have two generators, albeit largely the same code, which validate and generate JSON and JSON5 distinctly.

One interesting side-effect of our choice of processing identifier-style “member names” as strings is that we are unable to distinguish them as (originally) identifiers when writing. Indeed we can’t distinguish between (originally) single- and double-quoted strings.

Another string-issue is that we are unable to reproduce any of the original escapes from the input stream. The Example 2 of '\uD83C\uDFBC', which is U+1F3BC, MUSICAL SCORE, will be emitted as a UTF-8 sequence: 🎼.

That’s not wrong but it’s not Art. How might we choose to re-escape code points?

Pernicious RFC detail for strings: the “control characters” U+0000 through U+001F MUST be escaped. I can’t say as I’ve ever noticed that although, in my defence, I don’t recall ever having had a snippet of JSON where I expected to have a control character in a string. It must be true. Of all implementations.

Another side-effect is that hexadecimal numbers in the original will have been translated to integers and not remain flagged as hexadecimal numbers when we come to output them with the examples of -0xC0FFEE and 0xdecaf becoming the less obvious -12648430 and 912559 respectively.

None of these side-effects should matter as JSON[5] is a machine data interchange format. But humans like to read the output.

From Idio

With a mechanism to read JSON5 from an external UTF-8 source – presumably a file (descriptor) – it shouldn’t be too much effort to reuse the existing code to read JSON5 from an Idio string.

Even if we need to reallocate the space (as neither the JSON5 code nor the Idio code know how long they other will maintain the memory allocation) at least the underlying multi-width “unicode string” format is the same so it is a quick copy.

That means we should be able to say:

v-in := json5/parse-string "
{
  foo : \"bar\"
}
"

Albeit what we really want to do is pass Idio hash tables etc.:

v-out := json5/generate {
  foo : "bar"
}

The question is, what is v-out? It’s a string, the UTF-8 representation of the JSON5 encoding of the Idio hash table (as a JSON5 object).

What we are saying, here, though, is that there is no native JSON5 type in Idio. That’s not to say that json5_value_t etc. have disappeared it’s that they are not accessible to Idio. They were used, fleetingly, in translation, like a (normal) programming language’s Abstract Syntax Tree.

We read a JSON[5] input stream from an external source (file or string), have it validated and a local representation of the data returned.

When we convert Idio values to JSON[5], we don’t get some cascading JSON[5] structure, we just had that in the Idio value, we get the UTF-8 representation of the wire-ready JSON[5] (output) stream.

Last built at 2024-12-21T07:11:01Z+0000 from 463152b (dev)