.. include:: ../../global.rst .. _USI: *************************** Unicode Summary Information *************************** I hit upon Simon Schoenenberger's `Unicode character lookup table `_ work whilst chasing around a JSON5_ library and have re-imagined it. The basic premise is that you can walk through :file:`utils/Unicode/UnicodeData.txt` and extract: * the General Category * the *relative* Uppercase, Lowercase and Titlecase mappings * and Numeric Value .. sidebox:: 430 will increase by several tens or so when we start getting picky about Properties etc.. For example, the ``ASCII_Hex_Digit`` Property affects ASCII a-f and ASCII A-F and therefore makes them unlike the remaining ASCII lowercase and uppercase letters. and discover that, instead of ~34k entries in :file:`utils/Unicode/UnicodeData.txt` (representing 150k valid code points), there are just 430 variations. For example, the usual ASCII uppercase code points all become: ``Lu 0 32 0 0`` where the 32 means the corresponding lowercase code point can be derived from this by adding 32. Similarly, the ASCII lowercase code points all become ``Ll -32 0 -32 0``. The trailing zero is the numeric value which, as the ASCII uppercase and lowercase letters are not classed as numeric, will not be used. The C static structure initialisation needs something, though! In fact, there are 194 such ``Lu`` Unicode code points and 204 such ``Ll`` Unicode code points if we're just using :file:`utils/Unicode/UnicodeData.txt`. Similarly, there are 76 ``Lu`` and ``Ll`` code points where the uppercase and lowercase character differ by 40 rather than 32. Whilst 430 variations sounds good, we still have to get from a code point to there -- and we don't want a 1114112 (#x110000) entry lookup table. At this point Schoenenberger splits the code points into pages of 256 code points and does a double indirection from a ``PageIndex[cp >> 8]`` into an ``InfoIndex[page][cp & 0xff]`` to reference one of the 430 (529 in his case) variations. Those two tables come to some 43k words, which is a fraction of the nominal size. The trick he's pulling here is that when you divide the entire code point table up into groups of 256 code points then even those groups of 256, of what you might think were fairly random collection of references to the 430 unique variations, end up with duplicates. And this isn't talking about Planes 4 through 13 which will all map to an Unassigned variation. Unicode Character Database ========================== For everything, see http://www.unicode.org/reports/tr44/. The relevant files have been copied into :file:`utils/Unicode`. Broadly, Unicode have defined 17 Planes each with a potential for 65,536 code points. That makes any mapping be at least 1114109 entries long. Blocks within those Planes are assigned for groups of code points with a common ancestry. For example, code points ``U+0000`` through ``U+007F`` are ASCII. Some of those blocks are complete because the group is well defined, eg. ASCII, whereas others include unassigned code points leaving room for later additions. The order in which those blocks are assigned is (probably) an editorial whim. Note: even though Unicode Planes 4-13 are currently unassigned, see `Unicode Planes `_, it isn't clear that Unicode will stick to 17 planes. Given that we have to read the Unicode UCD files we might as well leave the actual number of Unicode code points/Planes dynamic based on reading the files. Further note: given that those 10 planes are unassigned, perhaps we should invent a sparse char-set format for char-sets saving at least 650k bits per char-set. Now defined as an array of seventeen :ref:`bitsets`. UnicodeData.txt --------------- :file:`utils/Unicode/UnicodeData.txt` is a long list (34k entries as of Unicode 13.0.0) of individual code points and code point ranges (whose individual code point Properties are derivable within the range). Each code point has a *General Category*: Letter lowercase, ``Ll``, Letter uppercase, ``Lu``, Number digit, ``Nd``, Symbol Math, ``Sm`` etc.. The set of code points is not contiguous and the current highest numbered code point is ``U+10FFFD``, Plane 16 Private Use, Last. DerivedCoreProperties.txt ------------------------- :file:`utils/Unicode/DerivedCoreProperties.txt` is a fairly long list (12k lines) which summaries collections of code points which share a common *Property*: ``Alphabetic``, ``Math``, etc.. For example, the *Property* ``Lowercase`` is the collection of the *Category* ``Ll`` plus the *Property* ``Other_Lowercase``. ``Other_Lowercase``, in case you were wondering, includes the likes of ``U+00AA, FEMININE ORDINAL INDICATOR`` and ``U+00BA, MASCULINE ORDINAL INDICATOR``, for example. :file:`utils/Unicode/DerivedCoreProperties.txt` defines other Properties such as ``ID_Start`` (the first code point of an identifier) and ``ID_Continue`` (subsequent code points for identifiers). I'm not sure where they get their information from as surely such a set is language-specific? You are welcome to read `UNICODE IDENTIFIER AND PATTERN SYNTAX `_ to understand their thinking. PropList.txt ------------ :file:`utils/Unicode/PropList.txt` is a fairly short file with more Properties. Here we're specifically looking to pick up on the ``White_Space`` Property. If you did trust their ``ID_Start`` then you might want to be aware of their ``Other_ID_Start``, defined in this file, too. Again, it's hard to see how they can clearly define a ``Pattern_White_Space`` and ``Pattern_Syntax`` Properties. GraphemeBreakProperty.txt ------------------------- :file:`utils/Unicode/GraphemeBreakProperty.txt` is a fairly short file with more breaks-between-things Properties. Reserved -------- Reserved code points are a feature. In particular they are not listed in :file:`utils/Unicode/UnicodeData.txt` but are referenced in the three properties files. That might not be an issue except :file:`utils/Unicode/GraphemeBreakProperty.txt` lists them as *Property* ``Control`` -- which is a Property we are interested in. Of course, so long as we only create summary information for code points in :file:`utils/Unicode/UnicodeData.txt` then we're good. Summarising =========== Across these four files there are 65 Properties: ``ASCII_Hex_Digit`` ``Alphabetic`` ``Bidi_Control`` ``CR`` ``Case_Ignorable`` ``Cased`` ``Changes_When_Casefolded`` ``Changes_When_Casemapped`` ``Changes_When_Lowercased`` ``Changes_When_Titlecased`` ``Changes_When_Uppercased`` ``Control`` ``Dash`` ``Default_Ignorable_Code_Point`` ``Deprecated`` ``Diacritic`` ``Extend`` ``Extender`` ``Grapheme_Base`` ``Grapheme_Extend`` ``Grapheme_Link`` ``Hex_Digit`` ``Hyphen`` ``IDS_Binary_Operator`` ``IDS_Trinary_Operator`` ``ID_Continue`` ``ID_Start`` ``Ideographic`` ``Join_Control`` ``L`` ``LF`` ``LV`` ``LVT`` ``Logical_Order_Exception`` ``Lowercase`` ``Math`` ``Noncharacter_Code_Point`` ``Other_Alphabetic`` ``Other_Default_Ignorable_Code_Point`` ``Other_Grapheme_Extend`` ``Other_ID_Continue`` ``Other_ID_Start`` ``Other_Lowercase`` ``Other_Math`` ``Other_Uppercase`` ``Pattern_Syntax`` ``Pattern_White_Space`` ``Prepend`` ``Prepended_Concatenation_Mark`` ``Quotation_Mark`` ``Radical`` ``Regional_Indicator`` ``Sentence_Terminal`` ``Soft_Dotted`` ``SpacingMark`` ``T`` ``Terminal_Punctuation`` ``Unified_Ideograph`` ``Uppercase`` ``V`` ``Variation_Selector`` ``White_Space`` ``XID_Continue`` ``XID_Start`` ``ZWJ`` and (the) 29 `General Category Values `_ covering Letters, Marks, Numbers, Punctuation, Symbols, Separators and Others. The 65 Properties is vexing. Firstly, they wouldn't fit as bit-fields in a 64-bit integer but also we need to take care as the *groupings of Categories*, such as ``Letter``, are, presumably, slightly disjoint from the *Property* ``Alphabetic`` (in :file:`utils/Unicode/DerivedCoreProperties.txt`) which is defined as ``Uppercase + Lowercase + Lt + Lm + Lo + Nl + Other_Alphabetic``. Ultimately, we're looking to define a set of testable C bitfields for *our* various purposes, not necessarily serving any Unicode question. SRFI-14 asks for quite a lot of char-sets which are derived as (with P for Property, C for Category): .. csv-table:: SRFI-14 char-set definitions :widths: auto :align: left lower-case, P Lowercase upper-case, P Uppercase title-case, C Lt letter, P Alphabetic digit, C Nd graphic, C L* + C N* + C M* + C S* + C P* white-space, P White_Space punctuation, C P* symbol, C S* hex-digit, P ASCII_Hex_Digit blank, C Zs + 0009 control, P Control regional-indicator, P Regional_Indicator extend-or-spacing-mark, P Extend + P SpacingMark hangul-l, P L hangul-v, P V hangul-t, P T hangul-lv, P LV hangul-lvt, P LVT JSON5 through ECMAScript and, in particular, the definition of `Identifier `_, wants to know *Categories* ``Lu`` ``Ll`` ``Lt`` ``Lm`` ``Lo`` ``Nl`` ``Mn`` ``Mc`` ``Nd`` ``Pc`` and *Property* ``ZWJ`` (and ``ZWNJ``) although it defines those last two as fixed values. Schoenenberger also distingushes: * between decimal and fractional numeric values as the fractional part is to be represented by a string, say, "1/4" That's useful. `Note that `_: (8) If the character has the property value Numeric_Type=Numeric, then the ``Numeric_Value`` of that character is represented with a positive or negative integer or rational number in this field and that, as of 13.0.0, there are no negative integers and a single negative rational. * where the upper/lower/title case causes an expansion in the number of code points We'll skip that, today. Across the needs of SRFI-14 and JSON5 we have 24 or so distinguishable flags (27 if we included the "expands" cases) with zero (flags) meaning "Unassigned". Plenty of room in a 32-bit flags bit-field. ---- We can create nominal C bit-field flags for those and then create our (originally 430) variations by generating the stringified static C structure elements: .. code-block:: c { flags, Category, UC offset, LC offset, TC offset, Numeric Value } to get 465 "summary" code points. As there's more than 256, references to these must be two bytes, that is the entire page table must be ``uint16_t``. Even without the flags field we still had the original 430 which, again, is more than 256. We now need a page indirection table to get us from a code point (one of #x110000) to a page in the page table and this indirection table will be :samp:`#x110000 / {page size}` entries. When we generate page tables (of 256 references to the 465 variants) we get some duplication of the "used" pages -- used by valid code points -- sometimes dramatically different (see the **#pg** vs **used pg** columns below). Again, depending on how many unique pages we have, determines whether one or two bytes is required for the page indirection table references into the page table. We can calculate the number of entries in the page table (:samp:`{unique pages} * {bytes per page ref} * {page size}`). The total number of bytes we require in the C code point to variation implementation is the sum of those two tables. ---- The next question comes about that choice of a page size of 256. What if we vary it a bit? Let's try some numbers: (these numbers are approximate as the code evolves) .. csv-table:: effects of varying page size :header: pgsz, pages, #pg, used pg, bytes, (bytes formula) :widths: auto :align: left 1024, 1088, 55, 62, 113728, ( 1088 * 1 + 55 * 2 * 1024) 512, 2176, 89, 104, 93312, ( 2176 * 1 + 89 * 2 * 512) 256, 4352, 146, 183, 79104, ( 4352 * 1 + 146 * 2 * 256) 128, 8704, 242, 329, 70656, ( 8704 * 1 + 242 * 2 * 128) 64, 17408, 398, 616, 85760, ( 17408 * 2 + 398 * 2 * 64) 32, 34816, 595, 1175, 107712, ( 34816 * 2 + 595 * 2 * 32) 16, 69632, 745, 2286, 163104, ( 69632 * 2 + 745 * 2 * 16) Here, * **pgsz** is our variable page size * **pages** is how many **pgsz** pages there are for #x110000 code points * **#pg** the number of *unique* **used pg** Notice how for smaller and smaller page sizes the chances of you seeing duplicates increases dramatically. The downside is the increase in the number of pages because of the small page size. * **used pg** are pages with valid code points in them * **bytes** and formula As noted, if the number of pages you need (**#pg**) is less than 256 you only need to use one byte offsets in the initial page indirection table (for Schoenenberger, ``PageIndex[cp >> 8]``). The second indirection table (for Schoenenberger, ``InfoIndex[page][cp & 0xff]``) is always a two byte offset (into the 465 entry summary information table) multiplied by the number of pages you require (**#pg**) and the size of the pages (**pgsz**). From which we can see a sweet spot, for this arrangement of flags etc., around a page size of 128 requiring 70656 bytes in C static tables. Schoenenberger used a page size of 256 for his arrangement. The final sizing element is fairly fixed as it is 465 times the size of an ``idio_USI_t`` holding the summarised code point information. For the upper/lower/title case offsets, some of the relative numbers are over 32k meaning we can't use an ``int16_t``, therefore it must be an ``int32_t`` which wastes a bit of space as the vast majority are 0 or small offsets. An ``idio_USI_t``, then, is 32 bytes, and therefore the variants tables is 465 times that, 14880 bytes. So a grand total of 85536 bytes (70656 + 14880) to store everything *we* need to know about Unicode. C Tables ======== With this information generated by :file:`bin/gen-usi-c` as :file:`src/usi.[ch]` we can now do a number of useful things: #. we could create authentic Unicode equivalents to :manpage:`isdigit(3)` and friends -- although we don't -- assuming the Categories and Properties we have chosen are enough to answer all the questions we have #. we can have :file:`src/usi-wrap.c` generate the SRFI-14 char-sets on the fly during startup. This saves a huge amount of time as :program:`idio` doesn't have to read and parse (and then still have to generate) the SRFI-14 char-sets from :file:`lib/unicode.full.idio` -- which, in turn, no longer needs to exist. .. include:: ../../commit.rst