Numbers and Constants

We will be using numbers and constants quite a bit in Idio. Do we really want to create a (relatively) enormous data structure to house a 32-bit integer? Or for #t, #f and #n – the only constants I can think of? Seems a bit wasteful.

Do we need 32-bit integers anyway? I suppose I’ve written scripts that have looped around several thousand times. I keep reading that there’s a million or so Unicode code points – 221 to be precise – if we can handle that “cheaply” then we might be onto something.

We can, of course, albeit we rely on an observation about modern C compilers in that they allocate memory on the heap on machine word boundaries. A 32-bit machine will allocate memory on a 4 byte boundary and a 64-bit machine on an 8 byte boundary.

(We’ll generally deal with 32-bit examples and reference 64-bit systems where they differ.)

If everything is allocated on a 4 byte boundary then the last two bits (three bits for 64-bit) are always zero:

xxxxxxxx xxxxxxxx xxxxxxxx xxxxxx00

A pointer to properly allocated value (ie. an IDIO) will always end 00. That suggests there’s room for us to be a bit sneaky for some kinds of types.

Suppose we were to artificially craft a “pointer” that used one of the other three bit combinations, other than 00, then we could use the remaining 30 bits for our value.

It would require that everywhere where we previously performed some test against a value’s o->type we would now have to test those bottom two bits first, handle the three special cases and otherwise fall back to the generic o->type.

Numbers

Let’s try an example. Suppose we were to use 01 for integers, hold that, for signed integers. This gives us 30 bits for the integer, less one bit for the sign, leaving us with 229 bits. That’s a pretty big number, ±536 million or so.

Assuming twos-complement, 1 and -1 would be:

00000000 00000000 00000000 00000101

11111111 11111111 11111111 11111101

from which we could squish an integer, i into an IDIO “pointer”:

int i = 1;
IDIO o = (i << 2) | 0x01;

and retrieve it with:

IDIO o = ...;
int i = ((intptr_t) o) >> 2;

Actually, that >> 2 will extract a complaint about arithmetic right shift from some C compilers as it makes an assumption that the top (sign) bit will remain the same (rather than be set to 0, say).

This is called a “tagged” type – we’re using the bottom two bits as a tag – and, here, the whole thing for integers, is called a fixed width number or fixnum.

We don’t want bit-twiddle ourselves all day so there’s a couple of C macros to help:

int i1 = 1;
IDIO o = IDIO_FIXNUM (i1);

int i2 = IDIO_FIXNUM_VAL (o);

The elephant in the room for fixnums is what happens if I want… ±537 million (bwah hah hah!)? No problem, we’ll use Bignums instead. We need to handle fixnum to/from bignum conversions but that’s OK.

In the meanwhile we have a relatively simple mechanism for handling a quite large range of numbers. Plenty for most shell-ish purposes.

fixnum.c

src/fixnum.c hosts not just the basic fixnum arithmetic but also is the front end for most arithmetic meaning it has to decide whether any of the numbers being processed are bignums and promote all the fixnums to bignums if that is the case.

There is also the Scheme-ish arithmetic style that, say, + for addition is not a binary function as in C, a + b, but is an n-ary function, here, meaning it takes zero or more arguments.

Fixnum arithmetic can overflow the limits of a fixnum, FIXNUM-MAX + 1 is an obvious case, resulting in a shift to bignum arithmetic.

Idio> bignum? (FIXNUM-MAX + 1)
#t

Implementation

Most of the arithmetic operations can be grouped and take a similar structure which can be abstracted into (rather large) C macros.

For fixnum comparisons there is a IDIO_DEFINE_FIXNUM_CMP_PRIMITIVE_(cname,cmp) which takes a C name snippet, cname, and a C comparison operator, cmp and generates a function.

IDIO_DEFINE_FIXNUM_CMP_PRIMITIVE_(le,<=) will generate a less-than-or-equal-to function called idio_fixnum_primitive_le wielding the C <= operator.

These are called in the next group of macros to provide fixnum variants of the bignum hand-coded idio_bignum_primitive_le and friends.

*

For general arithmetic or comparisons we need to check for any bignums in the argument list and call either the fixnum or the bignum variant of the arithmetic implementation. These checks are generic(-ish) and so are encapsulated in C macros.

Division is always converted to bignums – let’s not waste any time!

In all cases we are creating the (front-end) primitive and correspondingly we need to specific the Idio name, a C name snippet, an arity and varargs. In fact our arithmetic macro is simply generating a regular IDIO_DEFINE_PRIMITIVEx macro.

We’ve just said that our Scheme-ish arithmetic is n-ary but there is a subtle twist in that addition and multiplication can work with no arguments (resulting in 0 and 1, respectively) but subtraction and division require at least one argument.

There’s a bit more subtlety still in that:

  • for one argument there is an implied default value: (+ n) is implicitly 0 + n, ditto subtraction, and (/ n) is implicitly 1 / n, ditto multiplication

  • for subtraction and division with more than one argument the first is operated on by the rest: (- m n) is m - n and (/ m n) is m / n

    In other words, the implied default value is not used.

IDIO_DEFINE_ARITHMETIC_PRIMITIVE0V(name,cname) is used as IDIO_DEFINE_ARITHMETIC_PRIMITIVE0V ("+", add) creating an Idio primitive called + which will call idio_defprimitive_add().

Similarly IDIO_DEFINE_ARITHMETIC_CMP_PRIMITIVE1V(name,cname) is used as IDIO_DEFINE_ARITHMETIC_CMP_PRIMITIVE1V ("le", le) for an Idio primitive le calling idio_defprimitive_le.

Constants

Back to other possible limited width simple types. We know we have a bunch of constants, in fact, it’ll transpire we have half-a-dozen or so groups of constants (think: enumerated values) one of which, Unicode, we know uses 21 bits.

Quick bit of maths… 30 bits of space on a 32-bit system, minus 21 bits for Unicode, say, gives us 9 bits worth of different constant types. Hmm, I’m not sure we have 512 different constant types or, rather more importantly, even want to consider having 512 different types of constants. Far too many to remember!

Let’s flip the maths around and suggest we have 3 bits of constant types (up to 8, then) meaning each constant type could have up to 27 bits worth of space. Definitely room for Unicode’s 21 bits in one and I fancy we could squeeze our trio of regular constants, #t, #f and #n into another.

And if it turns out that 8 different constant types isn’t enough then we can revisit this and make it 16 or 32 or … and still have room for what is almost certainly our biggest group of constants, Unicode.

Let’s try that, then, with 10 as the tag and ccc being our 3 bits of “constant type differentiator”:

xxxxxxxx xxxxxxxx xxxxxxxx xxxccc10

In practice, for Idio constants, being the first out of the block, we’ll use 000 for ccc giving us a combined suffix of 00010. That leaves us to define some Idio constants:

idio.h
#define idio_S_nil           (0 << 5) | 0x00010
#define idio_S_true          (4 << 5) | 0x00010
#define idio_S_false         (5 << 5) | 0x00010

The other constant types are (ccc):

  • (001) reader tokens – like lexer tokens, they’re for flagging up states and it’s certainly easier to pass them around as full IDIO values.

  • (010) intermediate code tokens – stuff – we’ll get there!

  • (011) Unicode code points – of course!

There are some macros for the above but they’ve become slightly involved as I’ve abstracted out the number of bits and the shifts and so on.

We’ve still room for a third tagged type. I’ve not settled on one, though. I’m partial to a flonum type. Yes, a floating point number squeezed into 30 bits. IEEE 754, or something, with small mantissa and exponent. I don’t use floating point enough to warrant the effort especially when what little floating point I use is more than adequately handled by Bignums. One day.

In the meanwhile, the tag is reserved by the IDIO_PLACEHOLDER_TYPE.

Address Space

Remember, of course, that no current (early 2020s!) x86 CPU architecture lets you address more than 48 bits of physical address space anyway, see Virtual Address space.

Although I see Linux now supports ARM64’s Memory Tagging Extension which precludes the use of bits 59-56.

There is further commentary on pointer tagging with reference to hardware support for masking high-order bits. The summary being that the existing implementations (March 2022) are less than ideal for Linux (preventing the kernel from validating pointers) precluding their use.

The corollary to this is Andy Hertzfeld’s Mea Culpa decision to put flag bits in the top 8 bits of the original Macintosh, given that the 32 bit Motorola 68000 used a 24 bit address bus.

The Macintosh II used the Motorola 68020 which used all 32 bits and precluded any such efficiency. The real problem being that the flag placement trick had crept into third party applications and it took a year or so to upgrade the software base to be “32 bit clean.”

NaN Boxing

As a slight diversion, we can cogitate on NaN boxing. Bob Nystrom’s Crafting Interpreters covers this nicely in his NaN Boxing section so we’ll just highlight the gist.

Everything (-ish) is a C double (or can be stuffed into the equivalent 8 bytes). IEEE 754 allows for some exceptional numbers, notably, Not-a-Number (NaN) instances, for example the result of dividing by zero. Thanks to the way NaNs are implemented, there’s quite a lot of possible NaNs available and only a few of which are defined.

The trick is to extend the set of “quiet” NaNs (as opposed to “signalling” NaNs, eg. division by zero). The quiet NaNs can use some 50 bits of mantissa (plus the sign bit, if you care).

The 50 (or 51) bits is interesting because, if you recall from earlier, current x86_64 systems can only address 48 bits of address space. Hmm, 48 is less than 50, for sure, so we could put our pointers in the mantissa bits of a quiet NaN. Not just that but those mantissa bits are in the same position as the address bits so there’s no bit shifting required. Whoa!

Reading

Reading numbers is a bit more awkward so we’ll leave that to Bignums – consider that we can’t differentiate between 3 and 3.14 based on the 3 until we’ve read the whole number in completely. Therefore all numbers are read as bignums and “downgraded” as appropriate.

We don’t read in many constants – most of them are internally generated by the reader or evaluator etc.. We do read in a few though and they all start with #.

The reader has a large switch statement which, having found a #, can case the next byte read:

Reading constants and characters

byte

result

n

idio_S_nil

t

idio_S_true

f

idio_S_false

\

idio_read_character(... IDIO_READ_CHARACTER_EXTENDED)

U

idio_read_unicode(...)

idio_read_character(..., int kind) can now do some decision making of its own. It reads the next (UTF-8) character (don’t forget ASCII characters are valid UTF-8 characters):

  • If kind was IDIO_READ_CHARACTER_SIMPLE then #\X has X converted into a Unicode code point: #\ħ will result in the Unicode code point U+0127 (LATIN SMALL LETTER H WITH STROKE).

  • Otherwise if the character was { then we go on to read in a Scheme-ish named character, eg. #\{newline} – aka U+000A or C’s '\n'.

    There isn’t a particularly comprehensive set of named characters (think: a few test cases) but since we construct our Unicode tables from Unicode source material there’s nothing to stop us creating the complete set of named Unicode characters such that #\{LATIN SMALL LETTER H WITH STROKE} gets us U+0127.

  • else it was a Unicode code point (again)

    (Reading that back it could do with some refactoring!)

idio_read_unicode() checks that the next character is + then reads hexadecimal characters (technically it calls the same generic hexadecimal number reading code as #xhhhh does, see Non-base-10 numbers) and creates a Unicode code point from the number: #U+0127 creates the same U+0127 Unicode code point as #\ħ above.

Writing

Writing fixnums is quite easy as we know they are intptr_t (as they fit inside an IDIO “pointer”) so we can print them back out with printf ("%td", IDIO_FIXNUM_VAL (o)).

Except that’s a bit too easy and a bit too restrictive. In particular, that will print out a decimal number. What if we want a hexadecimal or octal number?

Here, the code that converts Idio types to strings can query the system as to whether the idio-print-conversion-format symbol has been set and, assuming it is one of the usual suspects (Xbdox), print it out.

b? Yes, of course. Generate the binary representation of our fixnum:

Idio> n := 17
Idio> format "%b\n" n
"10001\n"

For the constants we can capture the individual values and return specific strings:

util.c
case IDIO_TYPE_CONSTANT_IDIO_MARK:
{
    intptr_t v = IDIO_CONSTANT_TOKEN_VAL (o);

    switch (v) {
        case IDIO_CONSTANT_NIL:    t = "#n";        break;
        case IDIO_CONSTANT_UNDEF:  t = "#<undef>";  break;
        case IDIO_CONSTANT_UNSPEC: t = "#<unspec>"; break;
        case IDIO_CONSTANT_EOF:    t = "#<eof>";    break;
        case IDIO_CONSTANT_TRUE:   t = "#t";        break;
        case IDIO_CONSTANT_FALSE:  t = "#f";        break;
        case IDIO_CONSTANT_VOID:   t = "#<void>";   break;
        case IDIO_CONSTANT_NAN:    t = "#<NaN>";    break;

where t is a temporary variable to help calculate the final result (which is complicated by Unicode). You can also see other internal values getting a rendering that is quite readable to us programmers but “impossible” for the reader to consume (#< is specifically invalid for this reason).

Operations

Numbers

function fixnum? value

is value a fixnum

function integer? value

is value a fixnum or an integer bignum

function number? value

is value a fixnum or a bignum

function floor number

the floor of a number is the integral value less than or equal to the number

function remainder number

the remainder of a number is the value minus the floor of the number

function quotient a b

the quotient of a and b is a / b

function le n [...]
function lt n [...]
function eq n [...]
function ge n [...]
function gt n [...]

perform numeric comparisons between the arguments (a minimum of one) and return #f if any adjacent pair of arguments fails the comparison

lt n1 n2 n3 n4 is equivalent to:

(and (lt n1 n2)
     (lt n2 n3)
     (lt n3 n4))

The default result is #t.

Notice that the function names are alphabetic rather than the traditional arithmetic symbols, <=, <, ==, >= and >. This is to maintain consistency and avoid semantic clashes with our (preferred) use of angle brackets for IO redirection.

You may recall that Bash’s [[ builtin command uses the same operators: -le, -lt, -eq, -ge and -gt.

eq, this numeric comparison, adds to the naming confusion with eq?, eqv? and equal?.

function + n [...]
function - n [...]
function * n [...]
function / n [...]

perform the usual arithmetic functions of add, subtract, multiply and divide

function integer->char integer

[deprecated]

convert an integer to a character – limited to the range of fixnums

function integer->unicode integer

convert an integer to a Unicode code point

Constants

By and large there are no specific operations that you can perform on a constant.

However, see Characters for special cases.

Last built at 2024-09-07T06:11:28Z+0000 from 463152b (dev)