Bitsets¶
Bitsets exist because regular expression handling required Unicode support and that in turn meant we needed something to handle yes/no state information for a set 221 bits wide. Is this a lowercase letter?
That code doesn’t actually use bitsets blindly as it a) uses charsets which b) take advantage of several Unicode planes being unassigned to create a sparse array of bitsets.
Of course, now bitsets exist you can use them for any set of on/off flags.
I’ve lost my reference for where I got the inspiration for the implementation from – an answer on Stack Overflow, no doubt – but the implementation is entirely self-inflicted.
In C we can bit-twiddle to our hearts content on things up to a machine word in size, after that we’re out of luck. Even a single Unicode plane of 216 bits is a bit more than our 32-bit or 64-bit machine word will cope with.
The answer is, of course, an array of machine words with some suitable indexing to get us to the right word and then we bit-twiddle with the best of them.
Far more effort goes into reading and writing bitsets in a reader-friendly format, though. The need for a reader format derives from the observation that Unicode isn’t fixed – there’ll be another version along soon – and as good language implementers we should be using the latest and greatest and we’d rather auto-generate these tables than fix them into C.
Implementation¶
It was an array of unsigned long
but I got annoyed in writing
tests so normalised it to 32 bit width words.
As noted above, the implementation is an array of uint32_t
s:
typedef uint32_t idio_bitset_word_t;
typedef struct idio_bitset_s {
size_t size;
idio_bitset_word_t *words;
} idio_bitset_t;
#define IDIO_BITSET_SIZE(BS) ((BS)->u.bitset.size)
#define IDIO_BITSET_WORDS(BS,i) ((BS)->u.bitset.words[i])
where bit indexes start at zero in the usual way.
We want a handy macro:
#define IDIO_BITSET_BITS_PER_WORD (CHAR_BIT * sizeof (idio_bitset_word_t))
and then we’re off. The constructor has the basic trick of
determining the number of idio_bitset_word_t
s in the array by
using our handy IDIO_BITSET_BITS_PER_WORD
and adding one:
IDIO idio_bitset (size_t size)
{
IDIO bs = idio_gc_get (IDIO_TYPE_BITSET);
IDIO_BITSET_SIZE (bs) = size;
bs->u.bitset.words = NULL;
if (size) {
size_t n = size / IDIO_BITSET_BITS_PER_WORD + 1;
IDIO_GC_ALLOC (bs->u.bitset.words, n * sizeof (idio_bitset_word_t));
memset (bs->u.bitset.words, 0UL, n * sizeof (idio_bitset_word_t));
}
return bs;
}
We then have the broad swathe of expected operations on a bitset all
of which use a simple index, bit
:
IDIO idio_bitset_set (IDIO bs, size_t bit)
IDIO idio_bitset_clear (IDIO bs, size_t bit)
IDIO idio_bitset_ref (IDIO bs, size_t bit)
These will all use the same trick to get the correct
idio_bitset_word_t
in the array and then the bit within that
idio_bitset_word_t
is simply bit %
IDIO_BITSET_BITS_PER_WORD
.
Of course they have their Idio equivalents:
bitset-set! bs bit
etc..
The regular expression code wants to perform some operations on
bitsets as a whole and so there’s various: merge-bitset
,
and-bitset
, ior-bitset
, xor-bitset
and not-bitset
(to
toggle the entire bitset) all of which iterate over the
idio_bitset_word_t
s performing the operation as required.
There’s another interesting operation, subtract-bitset
. Here, in
Unicode terms, you might imagine having some derived charset and then
removing all the letters from it.
Equality for bitsets (beyond eq?
) is just as involved as you have
to iterate through comparing each idio_bitset_word_t
.
A final, much more functional operation is
bitset-for-each-set bs func
which, reflecting on the name,
is a slightly clumsy way of asking for a function to be run for each
bit that is set in the bitset. The function will be called with the
(integer) value of the bit that is set.
Given a 3-bit bitset with the first and last bits set, then by calling:
bitset-for-each-set bs f
we would expect the following calls to be invoked:
f 0
f 2
Reading and Writing¶
I’m combining these sections into one as the implementations are relatively complex and we’re only really concerned about the format.
The basic reader format is:
#B{ size ... }
size
is a decimal number. The other indexing numbers coming
are all hexadecimal. I decided to keep the size as decimal as it is
a bit more human friendly. It’s a moot point.
The bits are displayed as 0
s and 1
s in blocks of eight (or
less if we’re at the end of the bitset) to help casual human scanning.
So our 3-bit bitset might look like #B{ 3 101 }
.
The bits’ order is left-to-right and are indexed starting at zero. So
#B{ 3 100 }
means the first bit (with index 0) is set and the
other two are not. #B{ 3 001 }
means the last bit (with index 2)
is set and the other two are not.
So, you can imagine, the first pass for Unicode printed out several
lines with at least a million characters per line – I guess it
would be 221 0
s and 1
s plus another 218
spaces between the blocks of 8.
Even less didn’t like that and would lock up (going backwards). Not… great.
Careful analysis revealed quite a lot of consecutive zeroes, so we can eliminate those so long as we prefix the next block of not-all-zeroes with an offset which, given we’re chunking into groups of 8, is a hexadecimal number. Once we’ve picked up the current offset then the next block’s offset is implicitly 8 further on.
If we take our 3-bit example and insert 8 leading zeroes, instead of
#B{ 11 00000000 101 }
, we might have: #B{ 11 8:101 }
.
Which is good and less is a bit happier.
Now we notice long ranges of consecutive ones which we can chunk
together with a range identifier (hex, again): first
block-last block
where we are saying that it is all ones from the
start of the first
block to the end of the last
block. If the range is only one block you end up with the curious
looking 80-80
, say, which means bits are all ones from index 0x80
through to 0x87.
Another 11-bit example might be, then, #B{ 11 0-0 8:101 }
which
says that all 11 bits except the 10th (index 9, of course!)
are set.
So the variations are:
first-last
is saying all-ones from the start offirst
to the end oflast
.The next offset is implicitly
last + 8
offset:...
where the offset of the first0
or1
in...
isoffset
The next offset is implicitly
offset + 8
...
where the offset is implied based on one of the above or simply incremented from the previous block.
ASCII¶
By way of example, we can use the familiar ASCII character set to illustrate the variations. ASCII is 128 characters so we need 128-bit bitsets to represent various categories.
The lowercase letters would be #B{ 128 60:01111111 68-70
11100000 }
which reads as the seven starting at 0x61 plus 0x68-0x77
and then the next three: 0x78, 0x79, 0x7A. Which looks about right
(he says, carefully checking ascii(7)).
Similarly:
digits is
#B{ 128 30-30 11000000 }
hex-digits is
#B{ 128 30-30 11000000 01111110 60:01111110 }
which should be 0-9 and a-f and A-Fwhitespace is
#B{ 128 8:01111100 20:10000000 }
which looks like horizontal tab, newline, vertical tab, form feed, carriage return and space.punctuation is
#B{ 128 20:01110111 11101111 38:00110001 10000000 58:00011101 78:00010100 }
Notice a couple of gaps in the first two blocks:
U+0024 (DOLLAR SIGN) is not marked as punctuation. It is in the
Sc
category, ie. a Currency Symbol.U+002B (PLUS SIGN) is in the
Sm
category, Math Symbol.
The problem, here, in terms of the definitions of constituents of sets is that these are Unicode’s definitions of categories which may vary from expectations.
Unicode Char-sets¶
A quick digression while we’re here.
Unicode produces a wealth of character set data in its UCD (from which you usually
want the ucd
subdirectory).
This is a ton of information. Following note relating to Scheme’s SRFI-14 (Character-set library) there are three useful documents (which are duplicated within the Idio source code).
’sThe breakdown of each field of each file is documented under Property Definitions – you’ll need to search down for each file.
utils/Unicode/UnicodeData.txt
is the main list of all allocated Unicode code points.It uses ranges in case you’re wondering why it isn’t 150k lines long.
This is the primary source and for us has some key fields:
(field 2) General_Category: letter, digit, punctuation etc. see General Category Values
Note that the category is specific:
Lu
is letter uppercase andLl
is letter lowercase but that first character,L
of the category is also used for further classification in the form ofL*
meaning any kind of letter.(field 12) Simple_Uppercase_Mapping
(field 13) Simple_Lowercase_Mapping
If you’re wondering about Titlecase, Unicode’s FAQ entry Q: What is titlecase? How is it different from uppercase? and Grammar Monster might help explain.
utils/Unicode/PropList.txt
puts more characters into categories where their General_Category was something else.For example, U+0009 through U+000D are nominally ASCII Control characters, in the Category
Cc
but are tagged as White_Space inutils/Unicode/PropList.txt
.utils/Unicode/DerivedCoreProperties.txt
is similar toutils/Unicode/PropList.txt
If we parse all of that (see
utils/extract-unicode-char-sets.idio
) we can build up some
tables (of tables …).
As noted, the sparse char sets are an array of 17 Unicode planes where
each plane is represented by a bitset or #f
.
A bitset for a plane is 216 (65,536) entries – meaning an
underlying array of 211 or 210
idio_bitset_word_t
s! That’s partly why we’re keen to make the
array sparse. Why would we have planes 4 through 13 waste 8kB each
when we know there’s nothing in them.
ASCII, BMP0, full¶
Loading in the full Unicode set is a lot of data. For each character set (lowercase, uppercase, etc. – there are 39 in total!) we are using 56kB of memory in bitsets to do the mapping.
I mused with the idea of allowing the user to use only ASCII or only
the Basic Multilingual Plane (plane 0) or the full Unicode set and
added unnecessary complexity to
utils/extract-unicode-char-sets.idio
to do so.
The various outputs are in lib/unicode.*.idio
.
At the moment we’re stuck with full Unicode only.
Operations¶
- function bitset? value¶
is value a bitset
- function make-bitset size¶
construct a bitset of size bits
- function bitset-size bs¶
return the size of bitset bs
- function bitset-set! bs bit¶
set the bit indexed by bit in bitset `bs
- function bitset-clear! bs bit¶
clear the bit indexed by bit in bitset `bs
- function bitset-ref bs bit¶
return the bit indexed by bit in bitset `bs
- function copy-bitset bs¶
return a copy of bitset bs
- function merge-bitset [bs ...]¶
return a bitset that is the bitwise Inclusive OR of the supplied bitsets
If the bitsets are a different size then a
^rt-bitset-size-mismatch-error
condition will be raised.If no bitsets are supplied the result is
#n
.Note
This is
ior-bitset
…
- function and-bitset [bs ...]¶
return a bitset that is the bitwise AND of the supplied bitsets
If the bitsets are a different size then a
^rt-bitset-size-mismatch-error
condition will be raised.If no bitsets are supplied the result is
#n
.
- function ior-bitset [bs ...]¶
return a bitset that is the bitwise Inclusive OR of the supplied bitsets
If the bitsets are a different size then a
^rt-bitset-size-mismatch-error
condition will be raised.If no bitsets are supplied the result is
#n
.
- function xor-bitset [bs ...]¶
return a bitset that is the bitwise Exclusive OR of the supplied bitsets
If the bitsets are a different size then a
^rt-bitset-size-mismatch-error
condition will be raised.If no bitsets are supplied the result is
#n
.
- function not-bitset bs¶
return a bitset that is the bitwise complement the bitset bs
- function subtract-bitset [bs ...]¶
return a bitset that is the result of removing the bits set in subsequent bitsets from the first
If the bitsets are a different size then a
^rt-bitset-size-mismatch-error
condition will be raised.If no bitsets are supplied the result is
#n
.
- function equal-bitset? [bs ...]¶
return
#t
if all the supplied bitsets are bitwise identical,#f
otherwiseIf no bitsets are supplied the result is
#t
. (Should that be the case?)
- function bitset-for-each-set bs func¶
invoke func on each bit in bitset bs that is set
The argument to func will be the index of the set bit
Last built at 2024-12-21T07:11:04Z+0000 from 463152b (dev)