Variable Lookups

We can’t just “not look variables up” any more, that doesn’t make any sense. What we need is a better plan for handling variables. The plan is, to some degree, much like you would expect for a language like C where the use of a variable name is a thin veneer over the use of a memory location. You can imagine the C compiler allocating slots in memory, big enough to hold a value of type t and then whenever the variable’s name is referenced it substitutes in an access of that memory location.

To, hopefully, few people’s surprise that is exactly what we are going to do. We’ll have a big array of values in the VM and then whenever the evaluator sees a variable name it will reference its “name to value array index” table and replace the reference to the variable name with a reference to the variable’s index into this value array.

This doesn’t happen for all variable names. Lexically scoped variables are already handled with a frame lookup system, this only affects “free variables.”

So, our example of:

n := 3

define (foo) {
  n + 1
}

would become a nominal:

n := 3               ; n is allocated index 99, say

define (foo) {
  #vi:99 + 1
}

where #vi:index is our symbolism for this value array index.

There’s a couple of cases that make life a bit more interesting.

Modules

Modules in the context of variable lookups are going to act as a form of lookup restrictions. You can see what is in the module’s “top level” and what is in the exports of it’s imported modules (noting that everybody imports from the Idio module and all of its variables are exported automatically).

That means that (cross-importing aside) variable n in module m1 is different to variable n in module m2. Exactly what you’d expect.

In turn, as a side-effect, each module has to maintain a set of variable names that have been defined at “top level” in that module alongside the value index used for it.

If a variable name is used but is, as yet, undefined (in the current top level or the exports on the imported modules) then it will be added to the current module’s top level (rather than necessarily to the Idio module’s top level).

Of course that means we have to maintain a sense of what the current module actually is. In principle that’s no biggie but in practice it means being a little underhand. The problem lies in who is doing the recording of the current module. It should be the concern of the VM because it’s running the code, right? Actually, not really, the VM expects to be given value indices in the byte code and only really maintains the current module in case someone asks for it.

In practice, the evaluator really needs to know what the current module is in order that it can find the right value index: it needs to know if the source code is changing from module m1 to module m2 in order that it can figure out the correct value index for n to be put in the byte code for the VM.

Unfortunately, the only way the evaluator can do that is to identify the “change module” function call and read the argument itself. That, as a side-effect, means you cannot use a variable (or an expression) when changing module, you must use the module’s true (symbolic) name – we can’t expect the evaluator to rely on someone pre-evaluating arguments for it, can we?

Therefore, this won’t work:

mod := 'm1
module mod

because the evaluator isn’t really evaluating, it’s inferring meaning from the forms and can only see the symbol mod, so it thinks you’re changing to module mod and has no notion that m1 is involved in any way.

In the meanwhile, the evaluator can build up a symbol table for the module containing some useful information associated with each name created in the module.

The information is for internal use but currently consists of:

  • the variable type – probably toplevel in this instance

  • the VM constants index of name – so we don’t need to store the variable name supercalifragilisticexpialidocious in lots of places but just a small-ish number instead.

  • the VM value index – so we have somewhere to store the variable’s value

  • the module the symbol was originally bound in – this is clearly useless in the face of re-use but is handy for debugging

  • a string indicating the entity that created this

The constants index is a bit tricky. Nominally, each name is added uniquely to the VM’s constants index, so might be number eleventy billion (-ish). Quite correct. However, for this module, this might be the first constant ever referenced, so, uh, number one.

Encoding the numbers one and eleventy billion take different numbers of bytes. That might be of interest. Both numbers might be moot, anyway! We’ll come back to this.

You can access this information with find-symbol – noting that it takes a symbol as an argument, not a value:

Idio> find-symbol 'read
(#<CONST predef> 594 908 #<module libc> "idio_predef_extend")

Suggesting (to me) that the symbol was first created in the libc module as a predefined element (ie. a primitive) by the C function idio_predef_extend(). The symbol read is #594 in the constants table and this maps to value #908 in the values table.

You can dump out the VM’s constants and values tables if you pass the --vm-reports option to idio then search for those indices in vm-constants and vm-values (created when idio exits cleanly).

Forward Lookups

If we revisit the first example, we can legitimately re-write it as:

define (foo) {
  n + 1
}

n := 3

Hmm. When the evaluator works its way through the definition of foo it’ll find reference to the variable n which it hasn’t seen a declaration for (yet).

However, if n exists as one of the exported symbols from the modules that this module imports from then we’re about to have a problem. The evaluator will find n amongst those exports and use it, say #vi:e.

If we now come across the declaration of n a few lines later, about to be dubbed as #vi:i, say, it’s too late, we’ve already bound the n in foo to #vi:e. Whoops!

On the other hand, if n was not defined amongst the exports of imported modules then we can:

  • create a new value – initialised to some sentinel value, #<undef>, say

  • add an index in the name to value index table, #vi:k

  • use an opcode to access it marking that this is in preparation for an impending declaration

and carry on.

Now, when we hit the declaration of n, the lookup of the variable name will give us #vi:k and we’re good to go.

There’s a couple of boundary cases here:

  1. if we never see a declaration of n then the VM, because of the special opcode used which can check if the value is still #<undef>, will/should impute your descendency from rodents

    Although how should we deal with this (other than demeaning elderberries with your familial history)?

  2. repeated declarations of n (at the top level) do not give rise to new indices

    n := 10
    n := 20
    

    is functionally no different to:

    n := 10
    n = 20
    

Pre-Compiled Modules

There are no pre-compiled modules as I type – I aborted an attempt but the thinking prevails.

Suppose we can pre-compile modules as standalone entities that another Idio instance can load in and run.

Suppose our module looks like:

module m

n := 3

define (foo) {
  n + 1
}

Hmm. n will have been resolved to some value index, #vi:i, say.

However, under these module-compiling circumstances, we do not want to resolve n to this #vi:i because i is true of this currently running instance of Idio. If we stop and start Idio then the perturbation of values – imagine the modules are loaded in a different order – might mean that n should resolve to #vi:j.

What we really should be doing is storing a reference to the symbol n so that when we load in and run the pre-compiled module then we can determine that we need to lookup n under the current circumstances and generate the corresponding #vi:j then.

This becomes a double de-reference: once for name to value index and then to get the value itself. It looks like it would be useful to subsequently store this symbol to value index lookup in a per compiled-module “local value index” table.

Of course we don’t want to store symbols in the byte code either – as they are a very non-uniform size, unlike numbers which are merely slightly non-uniform – so we need a table of symbols now, while we are evaluating/compiling the source code, as well as well as the compiled code’s local value index table becoming a symbol table index to value index table…. Keeping up?

OK, instead of n being replaced with a value index in the generated byte code, we need to replace it with a symbol index, #si:x, say. So, what is x? Well, if we were back in the main compilation engine (rather than pre-compiling a module), we might use the VM’s “constant index” – so, arguably a #ci:x. The constants table is a big table of all the constants (no, seriously?) that the evaluator has ever seen. Symbols are unique and n is the same symbol whether it is in module m1 or m2.

However, in our pre-compilation mode, much like the value index, we can’t predict what order the constants are going to appear in a future Idio engine so we need a per-module per-compilation “name to constants index” table to be attached to the pre-compiled code – a symbol table. This makes it more like a #mci:x – a (pre-compiled) module constants index.

Here, x will be a small-ish integer and not the eleventy billion at the time of compilation.

So, for pre-compilation, we need to build a local symbol table of free variables used in the code and we substitute in the byte code a constant index, #ci:x and x is the index into the symbol table to recover the free variable’s name.

At run-time, we can use the index x into the compiled module’s symbol table to find the original name, n. We can look n up in the running VM to find the associated value index as seen by the current module, #vi:i.

However, we only need to lookup the value index once, after which it will always keep the same value index in the currently running VM. We should to store that lookup in a per-module value index table. (In fact we should look it up there first in case we have some Frankenstein module made from multiple loads.)

We can play another little trick. The byte code has #ci:x embedded in it. Suppose we always go straight to the per-module value index table. We can differentiate between the first and subsequent iterations by initialising the index here to -x.

At every access of the local value index table we can then perform a test: is the value negative?

  • if negative then access the symbol stored in the local symbol table at (positive) x and use the Idio engine’s lookup mechanisms to figure out if the symbol exists in the current module or an export of its imported modules or whether we need to create a new value index.

    Either way, we store the resultant (always positive) VM value index over the top of the initial -x local symbol table index value.

    Continue on to access the variable with the value index

  • if positive, ie. we’ve just done the above, then use the value index to access the variable

We have one observation, here, in that the per-module value index table is initialised to -x (a small-ish, perhaps single-digit, integer) but will be replaced by a potentially large VM value index (say, eleventy billion). The table needs to handle that.

The chances are that this per-module value index table is a fixed-width number and so the replacement value (an Idio fixnum, probably, so maybe eleventy billion is out of the question on a 32-bit system, just roll with it) will fit happily but the observation is that it can’t be a clever run-length encoded value – the sort of thing we use in the byte code.

That does suggest a limit on the number of possible variables. The chances are it’ll be a 32 bit entity (as 16 bits and therefore 65,536 variables seems too small and three bytes is too awkward) giving ample room for most programs’ variables. Obviously, this may need review.

In our small code block above it looks like we would have a single entry symbol table with n being in slot 1 and therefore the generated byte code will reference #ci:1.

Let’s use a slightly bigger example to flesh this out:

module m

a := 1
b := 2
n := 3

define (foo) {
  n + 1
}

So, not much bigger.

Our pre-compiled module, then, might have a local symbol table like:

1: a
2: b
3: n
4: foo

and our function foo, will have had n resolved to #mci:3 from that table – where we are now using a (pre-compilation) module constant index.

The local value index table is initialised to:

1: -1
2: -2
3: -3
4: -4

-3 says look up index 3 in the symbol table giving us n. We can now resolve that through the normal module import/exports mechanisms to get some value index, #vi:i, say, and then rewrite the local value index table as:

1: -1
2: -2
3: i
4: -4

Now we can access the value correctly with i and any subsequent use of #mci:3 will find the value index pre-resolved to i as well.

Module Value Table

index

initial

subsequent

1

-1

vi-a

2

-2

vi-b

3

-3

vi-n

Module Symbol Table

index

value

1

a

2

b

3

n

All this double-dereferencing and extra per-module symbol and value index tables seems a bit over the top. It does allow us to not have to re-write the byte code of pre-compiled modules (a suggestion in LiSP amongst others) which in turn allows us to maintain some integrity over it which has to be a good thing.

The only part that is modified is the value index table – which needn’t be part of the distributed code, anyway.

Perhaps, not hugely surprisingly, that all sounds very much like the business of the Unix link loader part of whose job is to resolve any symbol references.

Eli Bendersky has written a document on Position Independent Code in shared libraries which covers this sort of thing for x86 (real computers). What it boils down to though is that the generated code references a GOT associated with the code and the GOT is re-written to contain the actual memory location of the variable per this instance.

Augmenting Modules

And there’s more!

Suppose our little snippet of code was in addition to an already existing body of code under the guise of module m.

Should adding to an existing module be allowed? Should it be disallowed?

Skipping those vital questions, let’s suppose we’re augmenting module m with the function foo. Our pre-compiled code block has referenced “top level” variables a, b, n and foo. That’s great for the snippet but how does anyone else find out about these names?

It sounds like we also need to add those names, if necessary, to the module’s symbol table in the main Idio engine which, depending on what is stored there, might give us access to the associated value index without us having to look it up. Our snippet of code is always going to be (double) dereferencing through its local value index table but we have the opportunity to resolve any of those value indices with ones that already exist in the main module’s symbol table.

Forward Lookups II

Hmm, we had a problem with forward lookups where if no declaration of n was ever made we would have left a #vi:k in the byte code and have the Idio engine shouting from the ramparts.

This double dereferencing sounds plausible in this scenario to. If n is not defined then change the dereference from a #vi:k to a #ci:n and the byte code can at least retrieve the variable name from the constants table and throw that in the torrent of abuse directed at your antecedents.

Review

It looks like we have three varieties of lookups:

  1. non-pre-compiled code where we have definitively looked up the instance of a variable

    Here we can use a VM value index directly.

  2. non-pre-compiled code where we have a forward reference to a variable name

    Here we could use a value index (where we had initialised the value to some sentinel – which we check).

    Although we should use a VM constants index (causing a double dereference) in order that we can report the variable name. We still have to check the value has been defined.

  3. pre-compiled code where we don’t know about constant or value indices in the future Idio instance

    Here we will eventually use a per-compiled module local symbol and local value index tables. The local value index is initialised to -n so we can use the symbol table to get the variable name and then look it up, re-writing the local value index table in the process.

It also looks like we have a per-module symbol table with details of names that have been defined in its top level.

Last built at 2024-12-21T07:11:06Z+0000 from 463152b (dev)