Unicode Summary Information¶
I hit upon Simon Schoenenberger’s Unicode character lookup table work whilst chasing around a JSON5 library and have re-imagined it.
The basic premise is that you can walk through
utils/Unicode/UnicodeData.txt
and extract:
the General Category
the relative Uppercase, Lowercase and Titlecase mappings
and Numeric Value
430 will increase by several tens or so when we start getting picky
about Properties etc.. For example, the ASCII_Hex_Digit
Property affects ASCII a-f and ASCII A-F and therefore makes them
unlike the remaining ASCII lowercase and uppercase letters.
and discover that, instead of ~34k entries in
utils/Unicode/UnicodeData.txt
(representing 150k valid code
points), there are just 430 variations.
For example, the usual ASCII uppercase code points all become: Lu 0
32 0 0
where the 32 means the corresponding lowercase code point can
be derived from this by adding 32. Similarly, the ASCII lowercase
code points all become Ll -32 0 -32 0
.
The trailing zero is the numeric value which, as the ASCII uppercase and lowercase letters are not classed as numeric, will not be used. The C static structure initialisation needs something, though!
In fact, there are 194 such Lu
Unicode code points and 204 such
Ll
Unicode code points if we’re just using
utils/Unicode/UnicodeData.txt
. Similarly, there are 76 Lu
and Ll
code points where the uppercase and lowercase character
differ by 40 rather than 32.
Whilst 430 variations sounds good, we still have to get from a code point to there – and we don’t want a 1114112 (#x110000) entry lookup table.
At this point Schoenenberger splits the code points into pages of 256
code points and does a double indirection from a PageIndex[cp >>
8]
into an InfoIndex[page][cp & 0xff]
to reference one of the
430 (529 in his case) variations. Those two tables come to some 43k
words, which is a fraction of the nominal size.
The trick he’s pulling here is that when you divide the entire code point table up into groups of 256 code points then even those groups of 256, of what you might think were fairly random collection of references to the 430 unique variations, end up with duplicates. And this isn’t talking about Planes 4 through 13 which will all map to an Unassigned variation.
Unicode Character Database¶
For everything, see http://www.unicode.org/reports/tr44/. The
relevant files have been copied into utils/Unicode
.
Broadly, Unicode have defined 17 Planes each with a potential for
65,536 code points. That makes any mapping be at least 1114109
entries long. Blocks within those Planes are assigned for groups of
code points with a common ancestry. For example, code points
U+0000
through U+007F
are ASCII. Some of those blocks are
complete because the group is well defined, eg. ASCII, whereas others
include unassigned code points leaving room for later additions. The
order in which those blocks are assigned is (probably) an editorial
whim.
Note: even though Unicode Planes 4-13 are currently unassigned, see Unicode Planes, it isn’t clear that Unicode will stick to 17 planes. Given that we have to read the Unicode UCD files we might as well leave the actual number of Unicode code points/Planes dynamic based on reading the files.
Further note: given that those 10 planes are unassigned, perhaps we should invent a sparse char-set format for char-sets saving at least 650k bits per char-set. Now defined as an array of seventeen Bitsets.
UnicodeData.txt¶
utils/Unicode/UnicodeData.txt
is a long list (34k entries as
of Unicode 13.0.0) of individual code points and code point ranges
(whose individual code point Properties are derivable within the
range). Each code point has a General Category: Letter lowercase,
Ll
, Letter uppercase, Lu
, Number digit, Nd
, Symbol Math,
Sm
etc..
The set of code points is not contiguous and the current highest
numbered code point is U+10FFFD
, Plane 16 Private Use, Last.
DerivedCoreProperties.txt¶
utils/Unicode/DerivedCoreProperties.txt
is a fairly long list
(12k lines) which summaries collections of code points which share a
common Property: Alphabetic
, Math
, etc.. For example, the
Property Lowercase
is the collection of the Category Ll
plus the Property Other_Lowercase
.
Other_Lowercase
, in case you were wondering, includes the likes of
U+00AA, FEMININE ORDINAL INDICATOR
and U+00BA, MASCULINE ORDINAL
INDICATOR
, for example.
utils/Unicode/DerivedCoreProperties.txt
defines other
Properties such as ID_Start
(the first code point of an
identifier) and ID_Continue
(subsequent code points for
identifiers). I’m not sure where they get their information from as
surely such a set is language-specific? You are welcome to read
UNICODE IDENTIFIER AND PATTERN SYNTAX to understand their thinking.
PropList.txt¶
utils/Unicode/PropList.txt
is a fairly short file with more
Properties. Here we’re specifically looking to pick up on the
White_Space
Property.
If you did trust their ID_Start
then you might want to be aware of
their Other_ID_Start
, defined in this file, too.
Again, it’s hard to see how they can clearly define a
Pattern_White_Space
and Pattern_Syntax
Properties.
GraphemeBreakProperty.txt¶
utils/Unicode/GraphemeBreakProperty.txt
is a fairly short file
with more breaks-between-things Properties.
Reserved¶
Reserved code points are a feature. In particular they are not listed
in utils/Unicode/UnicodeData.txt
but are referenced in the
three properties files. That might not be an issue except
utils/Unicode/GraphemeBreakProperty.txt
lists them as
Property Control
– which is a Property we are interested in.
Of course, so long as we only create summary information for code
points in utils/Unicode/UnicodeData.txt
then we’re good.
Summarising¶
Across these four files there are 65 Properties:
ASCII_Hex_Digit
Alphabetic
Bidi_Control
CR
Case_Ignorable
Cased
Changes_When_Casefolded
Changes_When_Casemapped
Changes_When_Lowercased
Changes_When_Titlecased
Changes_When_Uppercased
Control
Dash
Default_Ignorable_Code_Point
Deprecated
Diacritic
Extend
Extender
Grapheme_Base
Grapheme_Extend
Grapheme_Link
Hex_Digit
Hyphen
IDS_Binary_Operator
IDS_Trinary_Operator
ID_Continue
ID_Start
Ideographic
Join_Control
L
LF
LV
LVT
Logical_Order_Exception
Lowercase
Math
Noncharacter_Code_Point
Other_Alphabetic
Other_Default_Ignorable_Code_Point
Other_Grapheme_Extend
Other_ID_Continue
Other_ID_Start
Other_Lowercase
Other_Math
Other_Uppercase
Pattern_Syntax
Pattern_White_Space
Prepend
Prepended_Concatenation_Mark
Quotation_Mark
Radical
Regional_Indicator
Sentence_Terminal
Soft_Dotted
SpacingMark
T
Terminal_Punctuation
Unified_Ideograph
Uppercase
V
Variation_Selector
White_Space
XID_Continue
XID_Start
ZWJ
and (the) 29 General Category Values covering Letters, Marks, Numbers, Punctuation, Symbols, Separators and Others.
The 65 Properties is vexing. Firstly, they wouldn’t fit as bit-fields
in a 64-bit integer but also we need to take care as the groupings of
Categories, such as Letter
, are, presumably, slightly disjoint
from the Property Alphabetic
(in
utils/Unicode/DerivedCoreProperties.txt
) which is defined as
Uppercase + Lowercase + Lt + Lm + Lo + Nl + Other_Alphabetic
.
Ultimately, we’re looking to define a set of testable C bitfields for our various purposes, not necessarily serving any Unicode question.
SRFI-14 asks for quite a lot of char-sets which are derived as (with P for Property, C for Category):
lower-case |
P Lowercase |
upper-case |
P Uppercase |
title-case |
C Lt |
letter |
P Alphabetic |
digit |
C Nd |
graphic |
C L* + C N* + C M* + C S* + C P* |
white-space |
P White_Space |
punctuation |
C P* |
symbol |
C S* |
hex-digit |
P ASCII_Hex_Digit |
blank |
C Zs + 0009 |
control |
P Control |
regional-indicator |
P Regional_Indicator |
extend-or-spacing-mark |
P Extend + P SpacingMark |
hangul-l |
P L |
hangul-v |
P V |
hangul-t |
P T |
hangul-lv |
P LV |
hangul-lvt |
P LVT |
JSON5 through ECMAScript and, in particular, the definition of
Identifier, wants
to know Categories Lu
Ll
Lt
Lm
Lo
Nl
Mn
Mc
Nd
Pc
and Property ZWJ
(and ZWNJ
) although it
defines those last two as fixed values.
Schoenenberger also distingushes:
between decimal and fractional numeric values as the fractional part is to be represented by a string, say, “1/4”
That’s useful.
(8) If the character has the property value Numeric_Type=Numeric, then the
Numeric_Value
of that character is represented with a positive or negative integer or rational number in this fieldand that, as of 13.0.0, there are no negative integers and a single negative rational.
where the upper/lower/title case causes an expansion in the number of code points
We’ll skip that, today.
Across the needs of SRFI-14 and JSON5 we have 24 or so distinguishable flags (27 if we included the “expands” cases) with zero (flags) meaning “Unassigned”. Plenty of room in a 32-bit flags bit-field.
We can create nominal C bit-field flags for those and then create our (originally 430) variations by generating the stringified static C structure elements:
{ flags, Category, UC offset, LC offset, TC offset, Numeric Value }
to get 465 “summary” code points.
As there’s more than 256, references to these must be two bytes, that
is the entire page table must be uint16_t
. Even without the flags
field we still had the original 430 which, again, is more than 256.
We now need a page indirection table to get us from a code point (one
of #x110000) to a page in the page table and this indirection table
will be #x110000 / page size
entries.
When we generate page tables (of 256 references to the 465 variants) we get some duplication of the “used” pages – used by valid code points – sometimes dramatically different (see the #pg vs used pg columns below).
Again, depending on how many unique pages we have, determines whether one or two bytes is required for the page indirection table references into the page table.
We can calculate the number of entries in the page table (unique pages
* bytes per page ref * page size
).
The total number of bytes we require in the C code point to variation implementation is the sum of those two tables.
The next question comes about that choice of a page size of 256. What if we vary it a bit? Let’s try some numbers:
(these numbers are approximate as the code evolves)
pgsz |
pages |
#pg |
used pg |
bytes |
(bytes formula) |
---|---|---|---|---|---|
1024 |
1088 |
55 |
62 |
113728 |
( 1088 * 1 + 55 * 2 * 1024) |
512 |
2176 |
89 |
104 |
93312 |
( 2176 * 1 + 89 * 2 * 512) |
256 |
4352 |
146 |
183 |
79104 |
( 4352 * 1 + 146 * 2 * 256) |
128 |
8704 |
242 |
329 |
70656 |
( 8704 * 1 + 242 * 2 * 128) |
64 |
17408 |
398 |
616 |
85760 |
( 17408 * 2 + 398 * 2 * 64) |
32 |
34816 |
595 |
1175 |
107712 |
( 34816 * 2 + 595 * 2 * 32) |
16 |
69632 |
745 |
2286 |
163104 |
( 69632 * 2 + 745 * 2 * 16) |
Here,
pgsz is our variable page size
pages is how many pgsz pages there are for #x110000 code points
#pg the number of unique used pg
Notice how for smaller and smaller page sizes the chances of you seeing duplicates increases dramatically.
The downside is the increase in the number of pages because of the small page size.
used pg are pages with valid code points in them
bytes and formula
As noted, if the number of pages you need (#pg) is less than 256 you only need to use one byte offsets in the initial page indirection table (for Schoenenberger,
PageIndex[cp >> 8]
).The second indirection table (for Schoenenberger,
InfoIndex[page][cp & 0xff]
) is always a two byte offset (into the 465 entry summary information table) multiplied by the number of pages you require (#pg) and the size of the pages (pgsz).
From which we can see a sweet spot, for this arrangement of flags etc., around a page size of 128 requiring 70656 bytes in C static tables. Schoenenberger used a page size of 256 for his arrangement.
The final sizing element is fairly fixed as it is 465 times the size
of an idio_USI_t
holding the summarised code point information.
For the upper/lower/title case offsets, some of the relative numbers
are over 32k meaning we can’t use an int16_t
, therefore it must be
an int32_t
which wastes a bit of space as the vast majority are 0
or small offsets.
An idio_USI_t
, then, is 32 bytes, and therefore the variants
tables is 465 times that, 14880 bytes.
So a grand total of 85536 bytes (70656 + 14880) to store everything we need to know about Unicode.
C Tables¶
With this information generated by bin/gen-usi-c
as
src/usi.[ch]
we can now do a number of useful things:
we could create authentic Unicode equivalents to isdigit(3) and friends – although we don’t – assuming the Categories and Properties we have chosen are enough to answer all the questions we have
we can have
src/usi-wrap.c
generate the SRFI-14 char-sets on the fly during startup.This saves a huge amount of time as idio doesn’t have to read and parse (and then still have to generate) the SRFI-14 char-sets from
lib/unicode.full.idio
– which, in turn, no longer needs to exist.
Last built at 2024-12-21T07:11:04Z+0000 from 463152b (dev)