Functions

We’ve two types of functions, primitives and closures, which are both abstractions to encapsulate parameterised behaviour.

Primitives are C code which have several broad categorisations:

  1. they are the only things that can manipulate C memory

    Which sounds a bit obvious but at some point it’ll get lost.

  2. they bootstrap the user-level environment

    Quite a lot of the bootstrap code is then wrappered with more exotic forms of the same thing. This is partly because doing anything complicated in C gets a bit awkward.

  3. they are quick

    Again, it sounds a bit obvious but after writing your glorious Idio function you might just have to bite the bullet and write a C version to get any performance.

  4. they are our interface to another world

    Whether that is system calls or inter-language calls.

Closures are Idio code.

The VM has to be able to cope with both types of functions as invocable things but there is an interesting reality-warping difference: primitives are instantaneous (mostly). So far as the VM is concerned, the moment a primitive is called the answer is back in its hands.

Primitives

Primitives are far harder to conceptualise and implement. So we’ll start with them.

VM Usage

Rather than working from the bottom up let’s ask the question, what does the VM need?

Well, it’ll want to know this is a primitive, we’ll need a pointer to the C function, we’ll need to know how many formal parameters it is expecting and whether it is expecting a variable number of parameters, over and above the formal parameters.

And we can probably chuck in a C string for a name:

gc.h
typedef struct idio_primitive_s {
    struct idio_s *grey;
    struct idio_s *(*f) ();           /* don't declare args */
    char *name;
    uint8_t arity;
    char varargs;
} idio_primitive_t;

#define IDIO_PRIMITIVE_GREY(P)       ((P)->u.primitive->grey)
#define IDIO_PRIMITIVE_F(P)          ((P)->u.primitive->f)
#define IDIO_PRIMITIVE_NAME(P)       ((P)->u.primitive->name)
#define IDIO_PRIMITIVE_ARITY(P)      ((P)->u.primitive->arity)
#define IDIO_PRIMITIVE_VARARGS(P)    ((P)->u.primitive->varargs)

As the comment notes, we don’t declare any arguments for the C function pointer as that will vary from primitive to primitive.

I’ve limited the number of formal parameters, arity, to 255. That should be OK for most people. In practice, you need to edit idio_vm_invoke() in src/vm.c to have more than five formal parameters.

varargs is a boolean and could, perhaps, should, be a type-specific flag. *shrugs*

I’ve skipped how one of these is created. That’s a complication we can come back to.

*

The actual function call needs to gather together the arguments to be able to say, broadly, for a binary function:

result = (IDIO_PRIMITIVE_F (prim)) (arg1, arg2, args);

The arguments will have been marshalled up by the VM ready for a generic function call, actually, marshalled up for a closure call. For a primitive we have to un-marshall them with args being the remaining arguments beyond the formal ones.

Wait! What happened with varargs? Ah, dirty secret time, all functions, closures and primitives, are always called with varargs arguments. The varargs field is used to syntactically verify the call point, not the invocation.

Bah! Even that’s not true. Fixed argument primitives are encoded such that they can be called directly by the VM with just their fixed argument, er, arguments. Varargs primitives go through the full function call interface (marshalled arguments and all).

C Construction

This is a good deal more involved.

We have an idea of what the VM wants, now we need to construct a primitive.

Obviously we need to write the actual code (duh!) but we simultaneously need to record the primitive’s pointer, arity, varargs and name – and that’s just for the VM. We might also want to record meta-information like the primitive’s signature and documentation string.

I like the idea of some sort of Type Inference in which case it is a requirement that the primitive functions fully declare their parameter and return types. This is to bootstrap the whole type inference system.

More stuff to record!

That sort of data, describing the primitive, looks like:

gc.h
typedef struct idio_primitive_desc_s {
    struct idio_s *(*f) ();
    char *name;
    uint8_t arity;
    char varargs;
    char *sigstr;
    char *docstr;
} idio_primitive_desc_t;

To make that recording happen I’ve cobbled together some increasingly complex C macros which require the developer to be on song.

When a primitive function is written we won’t use a stock C function signature, instead we’ll use a macro. So, rather than:

IDIO idio_pair_primitive (IDIO h, IDIO t)
{
    return idio_pair (h, t);         /* the actual constructor */
}

we use (in its simplest form):

IDIO_DEFINE_PRIMITIVE2 ("pair", pair, (IDIO h, IDIO t))
{
    return idio_pair (h, t);
}

Let’s break this down:

  • IDIO_DEFINE_PRIMITIVE2, from src/idio.h, says I’m defining a primitive that takes 2 formal parameters.

    IDIO_DEFINE_PRIMITIVE2V says two formal parameters and varargs.

    (You can guess the rest of the variations. Hopefully.)

  • "pair" is the C string name which will become the function’s Idio name

    In this case, I can expect to find an Idio function called pair:

    Idio> pair
    #<PRIM pair>
    
  • pair is the C function name snippet that will be prefixed with idio_defprimitive_

    Clearly, it should be unique amongst all other primitives which, if you keep it trivially aligned with the function’s Idio name should be OK.

    Notice that this newly created C function, idio_defprimitive_pair does not now clash with the pair constructor, idio_pair. It’s hardly in a separate namespace but so long as developers don’t call their regular C functions idio_defprimitive_name then we’re pretty safe.

  • (IDIO h, IDIO t) is the formal parameter list to the C function.

IDIO_DEFINE_PRIMITIVEx do two key things:

  1. create a static primitive function description structure for use later

  2. prefix the actual function code with an appropriate header

In combination, that means:

IDIO_DEFINE_PRIMITIVE2 ("foo-idio", foo_C, (T1 a1, T2 a2))
{
  ...
}

will expand to:

IDIO idio_defprimitive_foo_C (T1 a1, T2 a2);
static struct idio_primitive_desc_s idio_primitive_data_foo_C = {
   idio_defprimitive_foo_C,
   "foo-idio",
   2,
   0,
   "",
   ""
};
IDIO idio_defprimitive_foo_C (T1 a1, T2 a2)
{
  ...
}

which looks pretty promising.

The next variants have a _DS suffix meaning they expect a signature string and a documentation string.

The two strings are used when constructing user-friendly information for help and debugging.

The signature string for a closure is, in essence, the formal parameter list and so it should be similar for a primitive. This is straightforward for non-varargs functions, like pair where the signature string can be just h t.

For varargs functions, the interpretation varies considerably. For some functions the varargs might be “more of the same” but for others it might be optional parameters, like with display which takes a formal parameter, the value to be displayed, and an optional argument, encoded as varargs, for the handle to display to. If no handle is passed, it defaults to the current output handle.

I’d lean towards a signature string of v [handle] but this ability to be more specific about the meaning of varargs parameters is inconsistent with closures which can only report v & args from the formal parameters. Dunno.

The documentation string is for use with help where it will get printed out. I have this fanciful idea that the documentation string might be processed by, say, rst2txt (ReStructuredText to text) and so I’ve encoded most documentation strings with that in mind. What I haven’t done is made any progress in doing the processing.

The actual pair PRIMITIVE looks like:

IDIO_DEFINE_PRIMITIVE2_DS ("pair", pair, (IDIO h, IDIO t), "h t", "\
create a `pair` from `h` and `t`     \n\
")
{
    IDIO_ASSERT (h);
    IDIO_ASSERT (t);

    return idio_pair (h, t);
}

Note the assertions of the parameters being passed in. The only difference is that we’ve filled in the sigstr and docstr fields.

You can immediately guess that the non-_DS variants simply call the _DS variants with "" for the signature and documentation strings.

*

So far, then, we’ve organised to create a static instance of a struct idio_primitive_desc_s and defined a C function. We need to do something with it otherwise the pair of them will just hang about in the executable doing nothing useful as no-one knows to call them.

That requires an explicit use the of primitive’s description structure although we don’t need to know anything about that directly, we just need to know the C function name snippet and we can work out the rest from there.

You may recall all of the src/*.c files relating to types have a similar bootstrap structure including:

void idio_pair_add_primitives ()
{
}

into which we need to “add” our primitive:

void idio_pair_add_primitives ()
{
    IDIO_ADD_PRIMITIVE (pair);
}

IDIO_ADD_PRIMITIVE(pair) is another C macro which expands into:

void idio_pair_add_primitives ()
{
    idio_add_primitive (&idio_primitive_data_pair,
                        idio_vm_constants,
                        __FILE__,
                        __LINE__);
}

and idio_add_primitive() does the dirty business of adding a new symbol (derived from the C string name) and have it reference the idio_primitive_t (itself constructed from the struct idio_primitive_desc_s).

In principle, this arrangement allows us to construct a primitive function “manually” as separate from the above sequence of “automated” construction. That said, I have yet to create a primitive manually so this dance through a static object feels like a bit of a waste.

That’s it. For primitives, then, we have a two-step shimmy, a C macro at the start of the function’s implementation and another macro to add it into the Idio engine.

There are, of course, a few variations on the theme. Normally, primitives are added to the Idio module and implicitly exported from that module. However, we might want to add a primitive to another module.

libc is a module wrappering a number of libc system calls where the result is useful as an Idio value (and thus requires some data representation transmutation). libc’s systems call names frequently clash with “normal” Idio usage – read is our canonical example. So, here, we want to add the primitive to a different module (libc rather than Idio).

Separately, we might want to export that primitive from the module. Indeed, all of the primitives in libc are exported. By exporting the name we can call it “long-hand” from elsewhere:

libc/getrlimit libc/RLIMIT_NOFILE

If getrlimit (and RLIMIT_NOFILE) was not exported from libc then the above call would fail.

For what it’s worth, getrlimit(2) will update a C struct rlimit. For us, the getrlimit primitive has then transcribed the results into an Idio struct-rlimit with a couple of familiar looking fields:

#<SI struct-rlimit rlim_cur:1024 rlim_max:524288>

Had we stored the result in a variable and quizzed the (simplistic) help system it would have told us a bit more about the structure instance:

Idio> x := libc/getrlimit libc/RLIMIT_NOFILE
#<SI struct-rlimit rlim_cur:1024 rlim_max:524288>
Idio> help x
struct-instance: x

struct-type: struct-rlimit

fields (rlim_cur rlim_max)

Closures

Closures are a bit more tricky to describe without knowing how the VM works. But let’s give it a go.

In essence, a closure needs two things:

  • some code (duh!)

  • an environment within which that code is run – from which we access the free variables (non-lexical variables) in the function

In our case we have compiled our code for a byte compiler which means everything is a bit more like assembly language programming and the code becomes a program counter (PC) such that we jump to that PC and keep processing.

The environment, here, is the combination of the lexical context at the point of creation and the module the closure was created in. Remember that many closures are created with the intent of managing private variables. It is quite unlike, say, C.

Documentation

Like Emacs Lisp we should allow the user to write a documentation string for their function which should be the third argument of four: function formals docstr body or define (name formals) docstr body.

If there are only three arguments then the body takes precedence – if all you want is for your function to return a string then that seems legitimate.

The use of docstr will need to adhere to our “single line” reader processing – which is partly why strings are allowed to be multi-line. So we’ll have something like:

common.idio
define (atom? x) "predicate to test if object is an atom

:param x: object to test" {
  not (pair? x)
}

Notice how the multi-line documentation string continues the “single line” from the end of the declaration through to the start of the body form.

At the moment the documentation string is not processed though I would expect some ReStructuredText-style formatting to handle parameter definitions and cross-references.

Implementation

Like primitives there is a two-step process to manipulating closures. We need to create one and then use it.

The data we need for a closure looks like:

gc.h
typedef struct idio_closure_s {
    struct idio_s *grey;
    size_t code_pc;
    size_t code_len;
    struct idio_s *frame;
    struct idio_s *env;
} idio_closure_t;

#define IDIO_CLOSURE_GREY(C)       ((C)->u.closure->grey)
#define IDIO_CLOSURE_CODE_PC(C)    ((C)->u.closure->code_pc)
#define IDIO_CLOSURE_CODE_LEN(C)   ((C)->u.closure->code_len)
#define IDIO_CLOSURE_FRAME(C)      ((C)->u.closure->frame)
#define IDIO_CLOSURE_ENV(C)        ((C)->u.closure->env)

Let’s take a look:

  • code_pc is the program counter for this closure

    It is a size_t because it is in index into the byte array of compiled code.

  • code_len is, not surprisingly, the length of the byte compiled code for this closure

    It’s not really used in anger but turns out to be quite handy if you want to disassemble the code for just one closure (otherwise the disassembler doesn’t know when to stop).

  • frame is a reference to the lexical context

  • env is a reference to the module the closure was created in

The frame and env require a little more explanation. When we run a closure, in our mind’s eye we look at the source code and can see two things:

  1. the lexical environment – which the closure body may or may not use

  2. the arguments passed into the function by the calling code

Let’s try an about-as-complicated-as-it-gets example. Here in module foo we create a function f which uses both “top level” variables from module foo as well as lexical variables defined outside of the function as well as the parameters passed in:

file foo.idio
module foo
export (f)

; top level variable in *foo*
this := 1

f = {

  ; lexical variable only visible inside this block
  that := 2

  function (the-other) {
    + this that the-other
  }
}

and

file bar.idio
module bar
import foo

n := 3

; call the "remote" function
f n

We have a reasonable feeling that the result of calling f 3 in module bar should be 6.

But wait! These two sets of information are disjoint. We have a set of lexical information which we can see by looking at the source that is defined in foo.idio and therefore the closure must have access to it and at the same time we have this set of parameters being passed in from who knows where (bar.idio). How does that external information get linked in with our local lexical information?

Well, that’s the nature of a linked list of frames of parameters. At the time of creation of the closure we stash the current linked list of frames – which will be all the parameters leading up to the definition of the closure in the source code – in the closure structure, the frame field.

The protocol is that when someone is due to invoke a closure they will construct a frame and fill it with the evaluated (or unevaluated for templates!) values and leave it in the val register.

The act of invocation does a sequence of framey things:

  1. it stashes the current frame (of the caller) on the stack

  2. it replaces the current frame with the stored frame of the closure, the original lexical environment of the closure

  3. it calls the closure:

    1. the very first thing the closure does is verify the arity is correct by checking the number of args in the frame in val

    2. the next thing it does is link the frame in val into the current frame (for the function defined in foo.idio) which we just swapped with the original lexical environment frame (from the call point in bar.idio)

      At this point, the frame tree looks like the values from the caller in the top-most frame and then everything the closure saw at the point of its definition:

      frame 0: [ 3 ]         ; from the call in bar
      frame 1: [ {that} ]    ; from the block in foo
      frame 2: #n            ; no other frames! {this} is in *env*
      
    3. at some point the closure has computed a value and left it in val and RETURNs

  4. the calling code’s frame is restored from the stack

    At this point we are exactly as before the call to the closure except we have had the frame of evaluated arguments in val replaced with the computed result of the closure.

Having said all that about the frame, we can repeat it in kind for env, the “top-level” environment that the closure was defined in:

  1. stash the caller’s top-level env on the stack

  2. it replaces the current env with the stored env of the closure, the original lexical top-level of the closure

  3. it calls the closure

    Now we can find this in the environment of the function f defined in module foo.

    In fact the (stored) environment is just the name of the environment at the time of definition, ie. foo in this case. That variables defined in this environment are an attribute of the module.

    So, even though f was called in bar, because f was defined in foo the current environment is toggled to foo and we can (successfully) find this in the known variables of foo – even though the call is “live” in bar.

  4. the calling code’s env is restored from the stack

Creation

Creating a closure is a little bit of art. In the first instance, we don’t have to but we will encode the creation of the closure as well as the code for the implementation of the closure in the output byte code. Subsequently, we have a generated code stream that means if we could store the byte code out as a loadable module then we have the creation code embedded in it.

When we hit a function abstraction:

function (a b) {
    a + b
}

we have two things:

  • a function prototype: this example takes two formal parameters and no varargs

  • the function implementation or body

In addition we have the lexical context (we should have the lexical context, we’re doing the evaluation!) and the current module.

In combination we have everything we require to satisfy the need to create a closure. Remember, though, with a function declaration, we’re only looking to define the function, not run it, so:

x := 1
define (y a b) {
    a + b
}
z := x + 1

is the pseudo-code:

define x as the result of evaluating 1

define y as the result of evaluation the function definition
function (a b) { a + b } into a closure value

define z as the result of evaluating x + 1

In other words, we’re not running the function during this process, just establishing it as something that can be run.

A closure value is, from the data structure above, a program counter, a code length, a future frame (of arguments) and a pointer to an environment.

The process is:

  1. we’ll create the code for the body in a temporary array

    From that we can calculate the code_len.

    Actually, the body:

    • is prefaced with an arity check and something to link the argument frames

    • has a RETURN instruction tacked on the end

    but that’s by the by.

  2. in the normal byte code array we’ll add the CREATE-CLOSURE instruction which requires an offset to the code_pc (which will be after the upcoming JUMP instruction)

    plus some standard fields:

    • the code length

    • a reference to the signature string

    • a reference to the document string

    All of which we will need to be able to construct the struct idio_closure_s.

  3. since we are only creating the closure, we don’t want to run it right now, we’ll add a JUMP instruction followed by the code length (again) so that we (during creation) jump over the function body.

    This means that when this creation code is run, we’ll see:

    1. CREATE-CLOSURE with a code_pc (and its three arguments: code length, signature string and documentation string) then a

    2. JUMP to beyond the end of the body

    3. the definition of z in the example, above.

    The CREATE-CLOSURE has been given the offset to the code_pc and can create the closure value. Neat!

    However we do it, we want to jump over the function body so as not to run it.

  4. We can now copy the function body from the temporary array

    The first byte of which (now in the normal byte code array) is code_pc.

    But wait, this is now an unknown number of bytes beyond the CREATE-CLOSURE instruction because we’re not sure how many bytes it took to encode the JUMP instruction. How do we determine what offset should have been?

    Of course, the answer is that we’re expecting this to happen, so made sure to look at the code length and figure out that either:

    1. it is a short jump in which case the jump length is encoded in a byte and the overall offset is two bytes (including one for the jump instruction)

    2. it is a long jump in which case we encode the jump length in (yet another) temporary byte code array and can say that the offset is one plus the length of this temporary array – plus one because the jump instruction takes one byte!

    The code generator can now add the correct offset to the code_pc after the CREATE-CLOSURE instruction.

Another way of looking at it is this abstraction of the generated byte code:

1...
2CREATE-CLOSURE (length of #3) sigstr docstr
3JUMP to #5
4function body (including a trailing RETURN)
5...

When we come to run the closure, the closure value has had code_pc set to #4 and will stop processing before it hits #5 – the definition of z – because it has a RETURN statement stamped on the end.

Local Assignments

We saw in let how let statements can be transformed into closed function calls. Woo! Great Art is being made.

Except it’s pretty inefficient. Every function call requires some saving and restoration of the current state, the creation of and assignment to a frame and the closed function call itself.

For every “local” variable. It doesn’t look pretty.

So I’ve created a Bit of a Hack™. The goal is to say that, where we are creating a “local” variable in the context of a function body, then do a bit of shuffling so that the variable is treated like an extra, unseen parameter to the function:

define (foo a) {
   b := 1
   a + 1
}

will have had created a frame for a and space for the not-necessarily-used varargs – which we’ll call v: #[ *a* *v* ]. We know that access to a is made with efficient SHALLOW-ARGUMENT-REF0-style calls in the VM.

When we see b being defined in the context of a function body then:

  • extend the list of names in the nametree for this level with b – we’ll need to insert a dummy name, #f for the varargs argument otherwise we’ll get an off-by-one error.

  • we’ll extend the frame with a slot for b, #[ *a* *v* *b* ]

  • and set that slot with a SHALLOW-ARGUMENT-SET2

and then carry on with further references to b now being to the otherwise unknown parameter b in slot #2 of the frame.

As we’ve updated the nametree then any enclosed blocks which make reference to b will successfully find it and reference with DEEP-ARGUMENT-REF i 2 as appropriate.

Top-Level Block

In the case of a top-level block such as:

{
  a := 1
  b := 2
  a + b
}

there is no enclosing function body for the definition of a so we have to revert to the previously described let mechanism.

The defintion of b, however, is now inside a function body, created by the implied let and so b can be handled as a local assignment.

Operations

function? value

is value a function, ie. a primitive or a closure

Last built at 2024-12-21T07:11:05Z+0000 from 463152b (dev)