Numbers and Constants¶
We will be using numbers and constants quite a bit in Idio.
Do we really want to create a (relatively) enormous data structure to
house a 32-bit integer? Or for #t
, #f
and #n
– the only
constants I can think of? Seems a bit wasteful.
Do we need 32-bit integers anyway? I suppose I’ve written scripts that have looped around several thousand times. I keep reading that there’s a million or so Unicode code points – 221 to be precise – if we can handle that “cheaply” then we might be onto something.
We can, of course, albeit we rely on an observation about modern C compilers in that they allocate memory on the heap on machine word boundaries. A 32-bit machine will allocate memory on a 4 byte boundary and a 64-bit machine on an 8 byte boundary.
(We’ll generally deal with 32-bit examples and reference 64-bit systems where they differ.)
If everything is allocated on a 4 byte boundary then the last two bits (three bits for 64-bit) are always zero:
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxx00
A pointer to properly allocated value (ie. an IDIO
) will always
end 00
. That suggests there’s room for us to be a bit sneaky for
some kinds of types.
Suppose we were to artificially craft a “pointer” that used one of the
other three bit combinations, other than 00
, then we could use the
remaining 30 bits for our value.
It would require that everywhere where we previously performed some
test against a value’s o->type
we would now have to test those
bottom two bits first, handle the three special cases and otherwise
fall back to the generic o->type
.
Numbers¶
Let’s try an example. Suppose we were to use 01
for integers,
hold that, for signed integers. This gives us 30 bits for the
integer, less one bit for the sign, leaving us with 229 bits.
That’s a pretty big number, ±536 million or so.
Assuming twos-complement, 1 and -1 would be:
00000000 00000000 00000000 00000101
11111111 11111111 11111111 11111101
from which we could squish an integer, i
into an IDIO
“pointer”:
int i = 1;
IDIO o = (i << 2) | 0x01;
and retrieve it with:
IDIO o = ...;
int i = ((intptr_t) o) >> 2;
Actually, that >> 2
will extract a complaint about arithmetic
right shift
from some C compilers as it makes an assumption
that the top (sign) bit will remain the same (rather than be set to 0,
say).
This is called a “tagged” type – we’re using the bottom two bits as a tag – and, here, the whole thing for integers, is called a fixed width number or fixnum.
We don’t want bit-twiddle ourselves all day so there’s a couple of C macros to help:
int i1 = 1;
IDIO o = IDIO_FIXNUM (i1);
int i2 = IDIO_FIXNUM_VAL (o);
The elephant in the room for fixnums is what happens if I want… ±537 million (bwah hah hah!)? No problem, we’ll use Bignums instead. We need to handle fixnum to/from bignum conversions but that’s OK.
In the meanwhile we have a relatively simple mechanism for handling a quite large range of numbers. Plenty for most shell-ish purposes.
fixnum.c¶
src/fixnum.c
hosts not just the basic fixnum arithmetic but
also is the front end for most arithmetic meaning it has to decide
whether any of the numbers being processed are bignums and promote all
the fixnums to bignums if that is the case.
There is also the Scheme-ish arithmetic style that, say,
+
for addition is not a binary function as in C,
a + b
, but is an n-ary function, here, meaning it
takes zero or more arguments.
Fixnum arithmetic can overflow the limits of a fixnum,
FIXNUM-MAX + 1
is an obvious case, resulting in a shift to
bignum arithmetic.
Idio> bignum? (FIXNUM-MAX + 1)
#t
Implementation¶
Most of the arithmetic operations can be grouped and take a similar structure which can be abstracted into (rather large) C macros.
For fixnum comparisons there is a
IDIO_DEFINE_FIXNUM_CMP_PRIMITIVE_(cname,cmp)
which takes a
C name snippet, cname
, and a C comparison
operator, cmp
and generates a function.
IDIO_DEFINE_FIXNUM_CMP_PRIMITIVE_(le,<=)
will generate a
less-than-or-equal-to function called idio_fixnum_primitive_le
wielding the C <=
operator.
These are called in the next group of macros to provide fixnum
variants of the bignum hand-coded idio_bignum_primitive_le
and
friends.
*
For general arithmetic or comparisons we need to check for any bignums in the argument list and call either the fixnum or the bignum variant of the arithmetic implementation. These checks are generic(-ish) and so are encapsulated in C macros.
Division is always converted to bignums – let’s not waste any time!
In all cases we are creating the (front-end) primitive and
correspondingly we need to specific the Idio name, a
C name snippet, an arity and varargs. In fact our arithmetic
macro is simply generating a regular IDIO_DEFINE_PRIMITIVEx
macro.
We’ve just said that our Scheme-ish arithmetic is n-ary but there is a subtle twist in that addition and multiplication can work with no arguments (resulting in 0 and 1, respectively) but subtraction and division require at least one argument.
There’s a bit more subtlety still in that:
for one argument there is an implied default value:
(+ n)
is implicitly0 + n
, ditto subtraction, and(/ n)
is implicitly1 / n
, ditto multiplicationfor subtraction and division with more than one argument the first is operated on by the rest:
(- m n)
ism - n
and(/ m n)
ism / n
In other words, the implied default value is not used.
IDIO_DEFINE_ARITHMETIC_PRIMITIVE0V(name,cname)
is used as
IDIO_DEFINE_ARITHMETIC_PRIMITIVE0V ("+", add)
creating an
Idio primitive called +
which will call
idio_defprimitive_add()
.
Similarly IDIO_DEFINE_ARITHMETIC_CMP_PRIMITIVE1V(name,cname)
is
used as IDIO_DEFINE_ARITHMETIC_CMP_PRIMITIVE1V ("le", le)
for an
Idio primitive le
calling idio_defprimitive_le
.
Constants¶
Back to other possible limited width simple types. We know we have a bunch of constants, in fact, it’ll transpire we have half-a-dozen or so groups of constants (think: enumerated values) one of which, Unicode, we know uses 21 bits.
Quick bit of maths… 30 bits of space on a 32-bit system, minus 21 bits for Unicode, say, gives us 9 bits worth of different constant types. Hmm, I’m not sure we have 512 different constant types or, rather more importantly, even want to consider having 512 different types of constants. Far too many to remember!
Let’s flip the maths around and suggest we have 3 bits of constant
types (up to 8, then) meaning each constant type could have up to 27
bits worth of space. Definitely room for Unicode’s 21 bits in one and
I fancy we could squeeze our trio of regular constants, #t
, #f
and #n
into another.
And if it turns out that 8 different constant types isn’t enough then we can revisit this and make it 16 or 32 or … and still have room for what is almost certainly our biggest group of constants, Unicode.
Let’s try that, then, with 10
as the tag and ccc
being our 3
bits of “constant type differentiator”:
xxxxxxxx xxxxxxxx xxxxxxxx xxxccc10
In practice, for Idio constants, being the first out of the
block, we’ll use 000
for ccc
giving us a combined suffix of
00010
. That leaves us to define some Idio constants:
#define idio_S_nil (0 << 5) | 0x00010
#define idio_S_true (4 << 5) | 0x00010
#define idio_S_false (5 << 5) | 0x00010
The other constant types are (ccc
):
(
001
) reader tokens – like lexer tokens, they’re for flagging up states and it’s certainly easier to pass them around as fullIDIO
values.(
010
) intermediate code tokens – stuff – we’ll get there!(
011
) Unicode code points – of course!
There are some macros for the above but they’ve become slightly involved as I’ve abstracted out the number of bits and the shifts and so on.
—
We’ve still room for a third tagged type. I’ve not settled on one, though. I’m partial to a flonum type. Yes, a floating point number squeezed into 30 bits. IEEE 754, or something, with small mantissa and exponent. I don’t use floating point enough to warrant the effort especially when what little floating point I use is more than adequately handled by Bignums. One day.
In the meanwhile, the tag is reserved by the
IDIO_PLACEHOLDER_TYPE
.
Address Space¶
Although, somewhat unheralded, some Intel chips from 2018 can address 57 bits with an extra page table.
Remember, of course, that no current (early 2020s!) x86 CPU architecture lets you address more than 48 bits of physical address space anyway, see Virtual Address space.
Although I see Linux now supports ARM64’s Memory Tagging Extension which precludes the use of bits 59-56.
There is further commentary on pointer tagging with reference to hardware support for masking high-order bits. The summary being that the existing implementations (March 2022) are less than ideal for Linux (preventing the kernel from validating pointers) precluding their use.
—
The corollary to this is Mea Culpa decision to put flag bits in the top 8 bits of the original Macintosh, given that the 32 bit Motorola 68000 used a 24 bit address bus.
’sThe Macintosh II used the Motorola 68020 which used all 32 bits and precluded any such efficiency. The real problem being that the flag placement trick had crept into third party applications and it took a year or so to upgrade the software base to be “32 bit clean.”
NaN Boxing¶
As a slight diversion, we can cogitate on NaN boxing. Crafting Interpreters covers this nicely in his NaN Boxing section so we’ll just highlight the gist.
’sEverything (-ish) is a C double (or can be stuffed into the equivalent 8 bytes). IEEE 754 allows for some exceptional numbers, notably, Not-a-Number (NaN) instances, for example the result of dividing by zero. Thanks to the way NaNs are implemented, there’s quite a lot of possible NaNs available and only a few of which are defined.
The trick is to extend the set of “quiet” NaNs (as opposed to “signalling” NaNs, eg. division by zero). The quiet NaNs can use some 50 bits of mantissa (plus the sign bit, if you care).
The 50 (or 51) bits is interesting because, if you recall from earlier, current x86_64 systems can only address 48 bits of address space. Hmm, 48 is less than 50, for sure, so we could put our pointers in the mantissa bits of a quiet NaN. Not just that but those mantissa bits are in the same position as the address bits so there’s no bit shifting required. Whoa!
Reading¶
Reading numbers is a bit more awkward so we’ll leave that to
Bignums – consider that we can’t differentiate between 3
and 3.14
based on the 3
until we’ve read the whole number in
completely. Therefore all numbers are read as bignums and
“downgraded” as appropriate.
We don’t read in many constants – most of them are internally
generated by the reader or evaluator etc.. We do read in a few though
and they all start with #
.
The reader has a large switch
statement which, having found a
#
, can case
the next byte read:
byte |
result |
---|---|
|
|
|
|
|
|
|
|
|
|
idio_read_character(..., int kind)
can now do some decision making
of its own. It reads the next (UTF-8) character (don’t forget ASCII
characters are valid UTF-8 characters):
If
kind
wasIDIO_READ_CHARACTER_SIMPLE
then#\X
hasX
converted into a Unicode code point:#\ħ
will result in the Unicode code point U+0127 (LATIN SMALL LETTER H WITH STROKE).Otherwise if the character was
{
then we go on to read in a Scheme-ish named character, eg.#\{newline}
– aka U+000A or C’s'\n'
.There isn’t a particularly comprehensive set of named characters (think: a few test cases) but since we construct our Unicode tables from Unicode source material there’s nothing to stop us creating the complete set of named Unicode characters such that
#\{LATIN SMALL LETTER H WITH STROKE}
gets us U+0127.else it was a Unicode code point (again)
(Reading that back it could do with some refactoring!)
idio_read_unicode()
checks that the next character is +
then
reads hexadecimal characters (technically it calls the same generic
hexadecimal number reading code as #xhhhh
does, see Non-base-10
numbers)
and creates a Unicode code point from the number: #U+0127
creates
the same U+0127 Unicode code point as #\ħ
above.
Writing¶
Writing fixnums is quite easy as we know they are intptr_t
(as
they fit inside an IDIO
“pointer”) so we can print them back out
with printf ("%td", IDIO_FIXNUM_VAL (o))
.
Except that’s a bit too easy and a bit too restrictive. In particular, that will print out a decimal number. What if we want a hexadecimal or octal number?
Here, the code that converts Idio types to strings can query
the system as to whether the idio-print-conversion-format
symbol
has been set and, assuming it is one of the usual suspects
(Xbdox
), print it out.
b
? Yes, of course. Generate the binary representation of our
fixnum:
Idio> n := 17
Idio> format "%b\n" n
"10001\n"
For the constants we can capture the individual values and return specific strings:
case IDIO_TYPE_CONSTANT_IDIO_MARK:
{
intptr_t v = IDIO_CONSTANT_TOKEN_VAL (o);
switch (v) {
case IDIO_CONSTANT_NIL: t = "#n"; break;
case IDIO_CONSTANT_UNDEF: t = "#<undef>"; break;
case IDIO_CONSTANT_UNSPEC: t = "#<unspec>"; break;
case IDIO_CONSTANT_EOF: t = "#<eof>"; break;
case IDIO_CONSTANT_TRUE: t = "#t"; break;
case IDIO_CONSTANT_FALSE: t = "#f"; break;
case IDIO_CONSTANT_VOID: t = "#<void>"; break;
case IDIO_CONSTANT_NAN: t = "#<NaN>"; break;
where t
is a temporary variable to help calculate the final result
(which is complicated by Unicode). You can also see other internal
values getting a rendering that is quite readable to us programmers
but “impossible” for the reader to consume (#<
is specifically
invalid for this reason).
Operations¶
Numbers¶
- function fixnum? value¶
is value a fixnum
- function integer? value¶
is value a fixnum or an integer bignum
- function number? value¶
is value a fixnum or a bignum
- function remainder number¶
the remainder of a number is the value minus the floor of the number
- function quotient a b¶
the quotient of a and b is a / b
- function le n [...]¶
- function lt n [...]¶
- function eq n [...]¶
- function ge n [...]¶
- function gt n [...]¶
perform numeric comparisons between the arguments (a minimum of one) and return
#f
if any adjacent pair of arguments fails the comparisonlt n1 n2 n3 n4 is equivalent to:
(and (lt n1 n2) (lt n2 n3) (lt n3 n4))
The default result is
#t
.Notice that the function names are alphabetic rather than the traditional arithmetic symbols,
<=
,<
,==
,>=
and>
. This is to maintain consistency and avoid semantic clashes with our (preferred) use of angle brackets for IO redirection.You may recall that Bash’s
[[
builtin command uses the same operators:-le
,-lt
,-eq
,-ge
and-gt
.eq
, this numeric comparison, adds to the naming confusion witheq?
,eqv?
andequal?
.
- function + n [...]¶
- function - n [...]¶
- function * n [...]¶
- function / n [...]¶
perform the usual arithmetic functions of add, subtract, multiply and divide
- function integer->char integer¶
[deprecated]
convert an integer to a character – limited to the range of fixnums
- function integer->unicode integer¶
convert an integer to a Unicode code point
Constants¶
By and large there are no specific operations that you can perform on a constant.
However, see Characters for special cases.
Last built at 2024-12-21T07:11:05Z+0000 from 463152b (dev)