Unicode Summary Information

I hit upon Simon Schoenenberger’s Unicode character lookup table work whilst chasing around a JSON5 library and have re-imagined it.

The basic premise is that you can walk through utils/Unicode/UnicodeData.txt and extract:

  • the General Category

  • the relative Uppercase, Lowercase and Titlecase mappings

  • and Numeric Value

and discover that, instead of ~34k entries in utils/Unicode/UnicodeData.txt (representing 150k valid code points), there are just 430 variations.

For example, the usual ASCII uppercase code points all become: Lu 0 32 0 0 where the 32 means the corresponding lowercase code point can be derived from this by adding 32. Similarly, the ASCII lowercase code points all become Ll -32 0 -32 0.

The trailing zero is the numeric value which, as the ASCII uppercase and lowercase letters are not classed as numeric, will not be used. The C static structure initialisation needs something, though!

In fact, there are 194 such Lu Unicode code points and 204 such Ll Unicode code points if we’re just using utils/Unicode/UnicodeData.txt. Similarly, there are 76 Lu and Ll code points where the uppercase and lowercase character differ by 40 rather than 32.

Whilst 430 variations sounds good, we still have to get from a code point to there – and we don’t want a 1114112 (#x110000) entry lookup table.

At this point Schoenenberger splits the code points into pages of 256 code points and does a double indirection from a PageIndex[cp >> 8] into an InfoIndex[page][cp & 0xff] to reference one of the 430 (529 in his case) variations. Those two tables come to some 43k words, which is a fraction of the nominal size.

The trick he’s pulling here is that when you divide the entire code point table up into groups of 256 code points then even those groups of 256, of what you might think were fairly random collection of references to the 430 unique variations, end up with duplicates. And this isn’t talking about Planes 4 through 13 which will all map to an Unassigned variation.

Unicode Character Database

For everything, see http://www.unicode.org/reports/tr44/. The relevant files have been copied into utils/Unicode.

Broadly, Unicode have defined 17 Planes each with a potential for 65,536 code points. That makes any mapping be at least 1114109 entries long. Blocks within those Planes are assigned for groups of code points with a common ancestry. For example, code points U+0000 through U+007F are ASCII. Some of those blocks are complete because the group is well defined, eg. ASCII, whereas others include unassigned code points leaving room for later additions. The order in which those blocks are assigned is (probably) an editorial whim.

Note: even though Unicode Planes 4-13 are currently unassigned, see Unicode Planes, it isn’t clear that Unicode will stick to 17 planes. Given that we have to read the Unicode UCD files we might as well leave the actual number of Unicode code points/Planes dynamic based on reading the files.

Further note: given that those 10 planes are unassigned, perhaps we should invent a sparse char-set format for char-sets saving at least 650k bits per char-set. Now defined as an array of seventeen Bitsets.

UnicodeData.txt

utils/Unicode/UnicodeData.txt is a long list (34k entries as of Unicode 13.0.0) of individual code points and code point ranges (whose individual code point Properties are derivable within the range). Each code point has a General Category: Letter lowercase, Ll, Letter uppercase, Lu, Number digit, Nd, Symbol Math, Sm etc..

The set of code points is not contiguous and the current highest numbered code point is U+10FFFD, Plane 16 Private Use, Last.

DerivedCoreProperties.txt

utils/Unicode/DerivedCoreProperties.txt is a fairly long list (12k lines) which summaries collections of code points which share a common Property: Alphabetic, Math, etc.. For example, the Property Lowercase is the collection of the Category Ll plus the Property Other_Lowercase.

Other_Lowercase, in case you were wondering, includes the likes of U+00AA, FEMININE ORDINAL INDICATOR and U+00BA, MASCULINE ORDINAL INDICATOR, for example.

utils/Unicode/DerivedCoreProperties.txt defines other Properties such as ID_Start (the first code point of an identifier) and ID_Continue (subsequent code points for identifiers). I’m not sure where they get their information from as surely such a set is language-specific? You are welcome to read UNICODE IDENTIFIER AND PATTERN SYNTAX to understand their thinking.

PropList.txt

utils/Unicode/PropList.txt is a fairly short file with more Properties. Here we’re specifically looking to pick up on the White_Space Property.

If you did trust their ID_Start then you might want to be aware of their Other_ID_Start, defined in this file, too.

Again, it’s hard to see how they can clearly define a Pattern_White_Space and Pattern_Syntax Properties.

GraphemeBreakProperty.txt

utils/Unicode/GraphemeBreakProperty.txt is a fairly short file with more breaks-between-things Properties.

Reserved

Reserved code points are a feature. In particular they are not listed in utils/Unicode/UnicodeData.txt but are referenced in the three properties files. That might not be an issue except utils/Unicode/GraphemeBreakProperty.txt lists them as Property Control – which is a Property we are interested in.

Of course, so long as we only create summary information for code points in utils/Unicode/UnicodeData.txt then we’re good.

Summarising

Across these four files there are 65 Properties:

ASCII_Hex_Digit Alphabetic Bidi_Control CR Case_Ignorable Cased Changes_When_Casefolded Changes_When_Casemapped Changes_When_Lowercased Changes_When_Titlecased Changes_When_Uppercased Control Dash Default_Ignorable_Code_Point Deprecated Diacritic Extend Extender Grapheme_Base Grapheme_Extend Grapheme_Link Hex_Digit Hyphen IDS_Binary_Operator IDS_Trinary_Operator ID_Continue ID_Start Ideographic Join_Control L LF LV LVT Logical_Order_Exception Lowercase Math Noncharacter_Code_Point Other_Alphabetic Other_Default_Ignorable_Code_Point Other_Grapheme_Extend Other_ID_Continue Other_ID_Start Other_Lowercase Other_Math Other_Uppercase Pattern_Syntax Pattern_White_Space Prepend Prepended_Concatenation_Mark Quotation_Mark Radical Regional_Indicator Sentence_Terminal Soft_Dotted SpacingMark T Terminal_Punctuation Unified_Ideograph Uppercase V Variation_Selector White_Space XID_Continue XID_Start ZWJ

and (the) 29 General Category Values covering Letters, Marks, Numbers, Punctuation, Symbols, Separators and Others.

The 65 Properties is vexing. Firstly, they wouldn’t fit as bit-fields in a 64-bit integer but also we need to take care as the groupings of Categories, such as Letter, are, presumably, slightly disjoint from the Property Alphabetic (in utils/Unicode/DerivedCoreProperties.txt) which is defined as Uppercase + Lowercase + Lt + Lm + Lo + Nl + Other_Alphabetic.

Ultimately, we’re looking to define a set of testable C bitfields for our various purposes, not necessarily serving any Unicode question.

SRFI-14 asks for quite a lot of char-sets which are derived as (with P for Property, C for Category):

SRFI-14 char-set definitions

lower-case

P Lowercase

upper-case

P Uppercase

title-case

C Lt

letter

P Alphabetic

digit

C Nd

graphic

C L* + C N* + C M* + C S* + C P*

white-space

P White_Space

punctuation

C P*

symbol

C S*

hex-digit

P ASCII_Hex_Digit

blank

C Zs + 0009

control

P Control

regional-indicator

P Regional_Indicator

extend-or-spacing-mark

P Extend + P SpacingMark

hangul-l

P L

hangul-v

P V

hangul-t

P T

hangul-lv

P LV

hangul-lvt

P LVT

JSON5 through ECMAScript and, in particular, the definition of Identifier, wants to know Categories Lu Ll Lt Lm Lo Nl Mn Mc Nd Pc and Property ZWJ (and ZWNJ) although it defines those last two as fixed values.

Schoenenberger also distingushes:

  • between decimal and fractional numeric values as the fractional part is to be represented by a string, say, “1/4”

    That’s useful.

    Note that:

    (8) If the character has the property value Numeric_Type=Numeric, then the Numeric_Value of that character is represented with a positive or negative integer or rational number in this field

    and that, as of 13.0.0, there are no negative integers and a single negative rational.

  • where the upper/lower/title case causes an expansion in the number of code points

    We’ll skip that, today.

Across the needs of SRFI-14 and JSON5 we have 24 or so distinguishable flags (27 if we included the “expands” cases) with zero (flags) meaning “Unassigned”. Plenty of room in a 32-bit flags bit-field.


We can create nominal C bit-field flags for those and then create our (originally 430) variations by generating the stringified static C structure elements:

{ flags, Category, UC offset, LC offset, TC offset, Numeric Value }

to get 465 “summary” code points.

As there’s more than 256, references to these must be two bytes, that is the entire page table must be uint16_t. Even without the flags field we still had the original 430 which, again, is more than 256.

We now need a page indirection table to get us from a code point (one of #x110000) to a page in the page table and this indirection table will be #x110000 / page size entries.

When we generate page tables (of 256 references to the 465 variants) we get some duplication of the “used” pages – used by valid code points – sometimes dramatically different (see the #pg vs used pg columns below).

Again, depending on how many unique pages we have, determines whether one or two bytes is required for the page indirection table references into the page table.

We can calculate the number of entries in the page table (unique pages * bytes per page ref * page size).

The total number of bytes we require in the C code point to variation implementation is the sum of those two tables.


The next question comes about that choice of a page size of 256. What if we vary it a bit? Let’s try some numbers:

(these numbers are approximate as the code evolves)

effects of varying page size

pgsz

pages

#pg

used pg

bytes

(bytes formula)

1024

1088

55

62

113728

( 1088 * 1 + 55 * 2 * 1024)

512

2176

89

104

93312

( 2176 * 1 + 89 * 2 * 512)

256

4352

146

183

79104

( 4352 * 1 + 146 * 2 * 256)

128

8704

242

329

70656

( 8704 * 1 + 242 * 2 * 128)

64

17408

398

616

85760

( 17408 * 2 + 398 * 2 * 64)

32

34816

595

1175

107712

( 34816 * 2 + 595 * 2 * 32)

16

69632

745

2286

163104

( 69632 * 2 + 745 * 2 * 16)

Here,

  • pgsz is our variable page size

  • pages is how many pgsz pages there are for #x110000 code points

  • #pg the number of unique used pg

    Notice how for smaller and smaller page sizes the chances of you seeing duplicates increases dramatically.

    The downside is the increase in the number of pages because of the small page size.

  • used pg are pages with valid code points in them

  • bytes and formula

    As noted, if the number of pages you need (#pg) is less than 256 you only need to use one byte offsets in the initial page indirection table (for Schoenenberger, PageIndex[cp >> 8]).

    The second indirection table (for Schoenenberger, InfoIndex[page][cp & 0xff]) is always a two byte offset (into the 465 entry summary information table) multiplied by the number of pages you require (#pg) and the size of the pages (pgsz).

From which we can see a sweet spot, for this arrangement of flags etc., around a page size of 128 requiring 70656 bytes in C static tables. Schoenenberger used a page size of 256 for his arrangement.

The final sizing element is fairly fixed as it is 465 times the size of an idio_USI_t holding the summarised code point information. For the upper/lower/title case offsets, some of the relative numbers are over 32k meaning we can’t use an int16_t, therefore it must be an int32_t which wastes a bit of space as the vast majority are 0 or small offsets.

An idio_USI_t, then, is 32 bytes, and therefore the variants tables is 465 times that, 14880 bytes.

So a grand total of 85536 bytes (70656 + 14880) to store everything we need to know about Unicode.

C Tables

With this information generated by bin/gen-usi-c as src/usi.[ch] we can now do a number of useful things:

  1. we could create authentic Unicode equivalents to isdigit(3) and friends – although we don’t – assuming the Categories and Properties we have chosen are enough to answer all the questions we have

  2. we can have src/usi-wrap.c generate the SRFI-14 char-sets on the fly during startup.

    This saves a huge amount of time as idio doesn’t have to read and parse (and then still have to generate) the SRFI-14 char-sets from lib/unicode.full.idio – which, in turn, no longer needs to exist.

Last built at 2024-12-21T07:11:04Z+0000 from 463152b (dev)