Idio Feel

Unsurprisingly, Idio takes a lot of its programming language cues from Scheme.

Data Types

This isn’t a comprehensive list – which we can leave until we cover the implementation details.

What we’re looking to describe here are the entities the reader is willing to consume and construct objects from.

Simple Data Types

Atoms, in Scheme, these are the things are not s-exps.

Scheme introduces more exotic numbers with a leading # I’ve used # as the character to introduce all “funny” types. In one sense # is used by the reader to go into “weird stuff” mode.

Of interest, from that, when printing out internal types which have no reader representation – in other words it’ll have some clues for a developer but isn’t any use in (re-)creating one – they too start with a # and generally look like #<...>. If a # is followed by a < then the reader will issue an error.

nil/NULL

#n is about as short as we can get.

nil (NULL, None, …) is often used as the “no value” result which raises the idea of an Option type.

Anything that helps the programmer in a complex world has to be a good thing although adding an option type on its own is complexity for complexity’s sake. What the language implementation needs to make this worthwhile is the ability to enforce the testing of the option type in the caller’s code.

I don’t think we’re there.

Booleans

#t and #f.

Numbers

We won’t have a full Scheme number tower (think of a class-hierarchy where integers are rationals (fractions!) are reals are complex numbers) as I can’t believe it’s quite necessary for a shell. Integers and floating point numbers are common enough.

1, -2, 3.14 and 1e100 (for anyone googling).

Ignoring some of the more esoteric numbers we can possibly input some simple examples are:

  • #d for decimal numbers: #d10 (yes, a duplicate of the default form!)

  • #x for hexadecimal numbers: #xff

  • #o for octal numbers: #o007

  • #b binary numbers: #b00001111

The reader code that handles these allows for any numeric base up to 36 (the usual 0-9 followed by a-z/A-Z) although there aren’t any trivial reader input mechanisms.

Strings

"foo"

Hardly unexpected! Multi-line, though, as that’s normal, right?

str := "Line 1
Line 2
Line 3"

and we should be able to use familiar escape sequences:

printf "hi!\n"

We should be able to poke about in strings and get individual index references back in the usual way.

There is no single-quoted string, partly as ' is used by the reader as a macro but also because " isn’t doing any variable interpolation. "$PATH" is just five characters.

We can use interpolated strings but the obvious C-style sprintf function (cf. format in Python) does a lot of heavy lifting for us.

I have included the notion of a substring in Idio. My thinking is that a lot of the time we’re splitting strings into fields in the shell and rather than re-allocate memory for all of these why not have an offset and length into the original string? It works pretty well although how efficient it is is hard to tell. There is obviously the pathological case where you only want the single character substring of a two GB file….

Although not mentioned previously, Idio should be able to handle Unicode in strings. Unicode is, uh, a complication and we will dedicate some time to it later.

I’d like to say Idio can handle Unicode everywhere but that’s a work in progress.

Of course there is an implication for not being ASCII any more, Idio expects its source code and inputs to be encoded in UTF-8.

Characters

We’ll mention it in passing but it becomes a thing later, we need to be able to handle characters.

*ducks*

It is too soon in this text to be discussing it but the idea of a “character” is slipping away in the modern age.

Here, what we really mean is, given Idio is Unicode-oriented, that we want to be able to read in a Unicode code point.

While Unicode code points are just integers – there’s potentially 221 of them – they are a distinct set from regular integers: 1, 2, 3… not least because some of them are invalid.

This would break a few Scheme-ish libraries which might try to manipulate a C-ish c + 1 for some character c which make some assumptions about characters being ASCII.

Anyway, we need a reader input form: #U+hhhh.

Where hhhh is the commonly used Unicode code point hexadecimal number.

The traditional Scheme reader input form for characters is: #\X for some X so, with our near-ASCII hats on:

#\A
#\b
#\1
#\ħ

will create instances of the (now Unicode) characters, A, b, 1 and ħ (U+0127 (LATIN SMALL LETTER H WITH STROKE)).

For Unicode code points in general we get into a bit of a mess. The #\X format will always work (assuming that X is valid UTF-8) but the person viewing the source code might not see anything useful. In fact, I’m assuming you can see the correct glyph for ħ and not some substitute “box character.”

The problem lies in your viewer’s (text editor, web viewer) ability to draw the corresponding code point’s glyph. I don’t know if you have appropriate support for displaying glyphs outside of the “usual” ASCII ranges. My editors and fonts (I’m largely using DejaVu Sans and X11) don’t do a very good job outside of the Unicode BMP plane (the first 65,536 codepoints) and I wouldn’t know if they did a decent job within that code plane.

So, by and large, we’re probably better off using the #U+127 format for “exotic” characters in order that we give other users an outside chance of figuring out what we’re up to.

Exotic Base Types

I haven’t needed, yet, any further exotic fundamental types. A fundamental type is one where there is some performance or efficiency gain to be had for the construction (and deconstruction for printing) of what is, fundamentally, a string in the source code.

GUID

Post-creation, are these used in any way other than to compare them or print them out?

Commonly manipulated forms include (for the same GUID):

{123e4567-e89b-12d3-a456-426652340000}
(123e4567-e89b-12d3-a456-426652340000)
123e4567-e89b-12d3-a456-426652340000
123e4567e89b12d3a456426652340000
urn:uuid:123e4567-e89b-12d3-a456-426655440000
IP Addresses

My screen-scraping quite often results in a CIDR notation, being an IP address and a subnet mask, IPv4 and IPv6.

Is there a need for a fundamental type? Are they manipulated often enough? Maybe. I’ve survived for a while in the shell without them though chunking an IPv4 network up into /20 blocks was a bit annoying – and I did convert everything into 32bit numbers. Chunking my IPv6 address might be more fun.

IPv6 addresses have many forms too:

fe80:0000:0000:0000:01ff:fe23:4567:890a
fe80:0000:0000:0000:1ff:fe23:4567:890a
fe80:0:0:0:1ff:fe23:4567:890a
[fe80::1ff:fe23:4567:890a]
fe80::1ff:fe23:4567:890a
fe80::1ff:fe23:4567:890a%3
Pathnames

Pathnames are a constant vexation and I can’t quite decide how to fix the problem. What problem? Well, the problem is that pathnames should probably be treated specially.

Most pathnames that we use are functionally strings albeit ones we frequently pass as symbols (or words in Bash):

$ ls "${filename}"

Here, ls, is read by Bash as a word and Idio might read in as a symbol and, assuming the symbol ls wasn’t bound to a value would give us a… symbol for the element in functional position – our future command to be executed. Even in Bash, "${filename}" undergoes parameter expansion and quote removal to leave you with the word foo.txt, or whatever. Obviously(?), "${filename}" is a valid construction in Idio with the value, er, "${filename}" – you probably just wanted filename.

In both cases, however, the intent is that we find both ls and foo.txt in the filesystem which means they must be converted to C strings (being careful about ASCII NULs!) before ls is found on the user’s PATH (via a lot of calls to access(2)) and then ls(1) will ultimately stat(2) foo.txt.

Is there value at this point in handling these elements as special strings? That’s where I’m not sure. I have a sense that we should but we don’t.

File globbing will return a list of Idio strings. As things stand we are assuming the encoding is UTF-8 which is so so wrong – I mean, only technically wrong.

There are no encoding specifications for *nix filenames. About the nearest you’ll get to a specification is that a *nix filename is a sequence of bytes excluding U+0000 (ASCII NUL) and a directory entry also excluding U+0027 (SOLIDUS – forward slash). How you interpret those bytes is up to you. Or, rather, how you avoid interpreting those bytes defines you.

So, in that sense, we shouldn’t be treating filenames coming from or going to the filesystem as anything other than an opaque array of C characters. Symbols and strings we get from Idio source code will, by definition, be Unicode code points (originally encoded as UTF-8) but after that it’s all a bit free.

So, taking an example, I want a file called © 2021. We’re already in trouble with the first character! Should that be the ISO8859-1 character number 0xA9? Hmm, if someone is using the ISO-8859-2 encoding, a listing is likely to show them a Š, Unicode’s U+0160 (LATIN CAPITAL LETTER S WITH CARON). Any UTF-8 decoding will get an error. This character encoding mixup is called Mojibake.

Mind you, the problems of interpreting/displaying bytes for files in the wrong code page has been true since forever (well, post ISO-8859 and other implementations in the 1980s).

There’s another, slightly more insidious, problem with displaying arbitrary strings of bytes in that, quite often, those sequences of bytes can control the terminal. If you type ls and your terminal jumps into an Alternate Character Set then you’ll be most aggrieved.

In fact, how do we even create such a character? An Idio string, "©", will have assumed UTF-8 in the source which, when recreated/deconstructed into UTF-8 is a two byte sequence, 0xC2 0xA9.

Well, as it happens, our C API lets us create arbitrary C base types with the C/integer-> n char function albeit we probably don’t want to create filenames character by character.

If we want to use Idio strings as the basis for filenames (hint: we do), we also have the problem of “wide” characters in Unicode-based strings, ie. those where the Unicode code point is more than 0xFF. We can, of course, simply use the UTF-8 encoding as the filename but then we’re mixing up encodings (remember, the filesystem has none) with the danger of retrieving a filename from the filesystem which has an invalid UTF-8 encoding, like the 0xA9 in our ISO 8859-1 © 2021.

Hmm. What’s to do?

  • In the first instance, if the user supplies an Idio string as a filename then we’ll use the UTF-8 encoding of that string to access or create the file.

    Idio> "© 2021"
    "© 2021"
    Idio> "\ua9 2021"
    "© 2021"
    Idio> "\xc2\xa9 2021"
    "© 2021"
    
  • If we retrieve filenames from the file system then they will be in a “pathname” encoding, ie. just a stream of bytes.

Of course, we should allow a user to create a “pathname” encoded filename (string) for special purposes.

In particular, the Idio string, "\xa9 2021" will get a U+FFFD (REPLACEMENT CHARACTER) when it is used as a regular string and printed to a UTF-8 expectant terminal:

Idio> "\xa9 2021"
"� 2021"

because the Idio UTF-8 decoder was unable to decode 0xA9 when reading in the string. In other words this is a problem for Idio inputting what it thought was UTF-8 source.

So we need to force the interpretation of the “string” as raw bytes – most of which are likely to be perfectly valid UTF-8 encoded Unicode code points! I’ve mulled over % introducing formatted things (referencing printf(3)’s format specifier) and so a %P pathname format which doesn’t interpret the string as UTF-8 (although it probably will be mostly UTF-8 in the source code). Now I can say:

Idio> %P"\xa9 2021"
"© 2021"

Of interest, the reason you’re seeing the copyright symbol, there, is because the REPL has printed the value and printf will force a UTF-8 interpretation of 0xA9 which we can see with od:

Idio> printf %P"\xa9 2021\n"
© 2021
#<unspec>
Idio> printf %P"\xa9 2021\n" | od -t x1
0000000 c2 a9 20 49 61 6e 0a
0000007
#t

Notice the c2 a9 (before the 20) the UTF-8 encoding of 0xA9 itself?

However, we can use a simpler (printing) interface:

Idio> puts %P"\xa9 2021\n"
� 2021
6

With U+FFFD (REPLACEMENT CHARACTER) indicating that 0xA9 isn’t valid UTF-8 input for this terminal and the 6 means that write(2) wrote 6 bytes. od shows what was output:

Idio> puts %P"\xa9 2021\n" | od -t x1
0000000 a9 20 32 30 32 31 0a
0000007
#t

This time we can see the raw a9 (before the 20). In other words this is a problem for the terminal inputting what it thought was UTF-8 source. Looks pretty similar to the problem of Idio failing to decode a UTF-8 input stream before and therefore makes verifying correctness more… fun when an input failure and an output failure are visually identical.

Compound Data Types

Scheme doesn’t come out of the box with any (other than the pair-derived list) but Idio is not Scheme.

Pair

OK, we will have pairs – and therefore lists – but let’s call a pair a pair and we’ll construct one with pair and get the head and tail with ph (pair head) and pt (pair tail) respectively.

p := pair 1 2
ph p                 ; 1
pt p                 ; (2)

The Lispy cadr and friends become the Idioy pht etc..

I have deliberately departed from Lisps in that I don’t use . in the source/printed form. Largely because I wanted . for structure decomposition although the current choice of & isn’t my greatest decision:

ph '(1 & 2)

is a mental hiccup for people used to backgrounding commands in the shell. I fancy I will need to change my mind again.

One area I definitively want to change is varargs functions. So, based on the above, a varargs function is declared as:

define (foo a b & c) { ... }

What I really want is to make that more in the style of EBNF where what we’re really saying is that c captures “the rest” of the arguments supplied to the function. In an EBNF-ish way, we might have written c* giving us:

define (foo a b c*) { ... }

This requires a little tweak to the evaluator to identify a right-hand odd-number of * symbols in the name of the last formal argument and if so quietly re-write the function’s form as an improper list.

You need to be reasonably careful as someone is bound to have a variable *foo* with matching *s (so is just a regular symbol) but might be especially determined and use *foo** to mean varargs.

Arrays

Of course.

I’ve implemented dynamic arrays (partly because I wanted to use them internally for a stack) but note there is a subtlety in differentiating between the current size of the array in memory and how many elements are in use.

Broadly you can use any element up to the highest in-use element and you can push/pop and shift/unshift elements onto the array (affecting the number of elements in the array, obviously). When you create the array, you’re creating n slots now as a hint to avoid (possibly repeated) re-allocations of memory as the array grows.

You can access the array using negative indexes to index from the end.

Following the Scheme model we would create and access arrays along the following lines:

function make-array size [default]
Param size:

initial array size

Type size:

integer

Param default:

default element value

Type default:

any

default defaults to #f

a := make-array 2 "and"
array-ref a 1                        ; "and" - default value here, #f normally
array-set! a 0 "sue"
array-ref a 0                        ; "sue"

array-push! a "rita"
array-length a                       ; 3

array-ref a 99                       ; *error*

Naturally, armed with infix operators we can confuse ourselves with =+ and =- for push and pop and += and -= for shift and unshift. (Hint: think about whether the +/- is before or after the = – and they really should have an ! in them as they are modifying the underlying structure but there’s a limit). So I can say:

a =+ "too"

to push an element on the end and:

a += "bob"

to unshift (a Perl-ism?) onto the front.

Schemes allow conversions of an array to and from a list (list->array and array->list) which are neat enough ways to initialise an array. We’d probably like something more familiar such as:

a := #[ 1 2 3 ]

array-ref a 2                        ; 3

with #[ ... ] being an array constructor.

Hashes

Of course, these are de rigueur.

Not only will we use them for the usual… stuff … but they are the native representation of many structured interchange/file formats like:

  • JSON (from 2000?)

  • it’s more accommodating derivative, JSON5 (from 2012)

  • YAML (2001)

  • and the new pretender TOML (2013).

They’re going to work in a similar way to arrays:

h := make-hash #n #n 10

the #n indicate to use the default equivalence and hashing functions (don’t worry for now!). The 10 is, again, a hint as to how much memory to allocate.

hash-set! h "a" "apple"
hash-ref h "a"                       ; "apple"

and a similar initialiser using pairs:

h := #{ ("a" & "apple") ("b" & "banana") }

hash-ref h "b"                       ; "banana"

Python might use a JSON-style construct to initialise a dictionary which I quite like.

Structures

These seem like a good idea and they’re probably going to be the basis of other things.

In one sense the implementation is quite simple. You have an underlying compound data type to store the actual fields, let’s say an array, and the some mechanism to turn the request for a symbolically named field into a reference to the underlying actual field.

If I could define a structure type with:

define-struct bar x y

creating a structure type called bar with fields x and y. The declaration of which would create a number of structure manipulation functions allowing me to create an instance of one:

foo := make-bar 1 2

whereon I can access the elements with getters and setters:

bar-x foo            ; 1
set-bar-y! foo 10
bar-y foo            ; 10

That looks a little clumsy with a field reference being:

typename-fieldname instance

dot operator

It would be useful to have an infix operator come to our rescue for structures (and arrays and hashes and strings and …, for that matter) by expressing our intent and having something else figure out the detail. This is where I saw Perl’s Template::Toolkit and then Jinja show the way.

For our structure example:

foo.y                        ; 10

for arrays and hashes and strings you can do the sensible thing:

array.10
hash."a"

you can imagine these are reworked into

array-ref array 10
hash-ref hash "a"

We ought to be able to use variables too as:

i := 10
array.i

seems reasonable.

You can also assign to these constructs:

array.10 = "rita"
hash."c" = "carrot"

which will take us into the esoteric world of boxed variables later on.

I’ve also allowed for the index to be a (named) function where the (now mis-named) index is applied to the value allowing us to write:

str.split-string

which is trivially transformed into:

split-string str

Hardly rocket science albeit split-string is defaulting to using IFS as the delimiter. Here we might write a function to get the first letters of each word:

str := "hello world"

define (foo s) {
  map (function (w) {
         w.0
  }) s.split-string
}

printf "%s\n" str.foo                ; (#\h #\w)
printf "%s\n" str.foo.2              ; w

Note

The .2 in the str.foo.2 is accessing the second element of a list. Strings and array are indexed from 0 (zero) but lists from their first element.

*

This . syntactic sugar is much easier to read and understand (for us simple folk) although there is a cost. Idio is a dynamic language where only values have types and we can only reason with those types at runtime.

This means the dot operator has to plod through testing to see if its left hand side is an array, hash, string, structure, … and call the appropriate accessor function. That makes it relatively expensive.

So, write the underlying accessor function out in full by hand if speed is an issue.

Maybe, if, possibly, we did some type inference work we could recognise that h is a hash and therefore re-write the dot operation h."a" as hash-ref h "a" directly.

Warning

The dot operator is far from a panacea. In particular, the fact that it will allow you to use variables to walk over an array, say, arr.i, where i is some loop variable is great.

Until you want your i to be the symbolic name of a structure field, say, si.sym and you’ve only gone and defined a variable called sym somewhere else in your code.

Here you can force sym to be the symbol 'sym or use a type-appropriate accessor such as one of the field accessor methods for your structure’s type, say, st-sym si.

Handles

I don’t like the name portsI immediately thought of TCP/IP ports and was mildly confused.

Executive decision: I’m renaming them handles like Perl and others. So, file handles and string handles.

Templates

Macros in Lisps have had a bad press. If we rename them templates then we’re halfway to sorting the problem out, right?

Also I find the quasiquoting hard to read: `, , and ' combine to become a bit unintelligible (to non-Lispers). On top of which, I quite like $ as a sigil telling me I’m about to expand some expression, it’s much more shell-ish/Perl-ish.

Here’s my thinking. I’m going to replace quasiquote itself with my template which is not only going to be a bit weird for the reader but I need to indicate the start and end with, say, { and }, so something like:

#T{ ... $(h."a") ... }

Getting a little ahead of myself I was thinking about the problem of here-documents where the snippet may be targeted for a language which itself uses $ as a sigil. So what we need is the ability to indicate we want to use a different sigil for unquote – we could actually use (unquote ...), the function call, but nobody does that.

So # is required to tell the reader something weird is coming and T tells it to handle a template. Then { is there to mark the start of the “block” of quasiquoted code. Can we squeeze in some flags, maybe, to indicate a change of unquote sigil between the T and the {? Of course we can!

There’s more, though. We’ve also @, the extra sigil for unquote-splicing – which is fine as a default, we should be able to change that. Hang on, ' is the sigil for quote’ing things (again) – which, again, is fine as a default, we ought to be able to change that too. Finally, we could do with some means of escaping any of the above. By default, in Idio – it’s not a Scheme thing – that is \. Which is the, er, universal escape sigil.

The upshot of that is that we can let loose and say:

#T!%:;{ ... !%(a.3) $foo ... }

where !% is now our unquote-splicing sigil (assuming a.3 results in a list, of course!) and on we go with, say, (... 1 2 $foo ...) in our hands!

If you only wanted to change the escape sigil, say, you can use . for the others meaning use the default: #T...!{ ... }.

If you want to use . as your unquote sigil (or any other), tough!

Clearly there’s no immediate need to change any template sigils even if the snippet is for Scheme as the unquote sigil, $ doesn’t conflict with Scheme and we can embed one within the other with ease (probably, must try it!).

Pathname Templates

Pathname templates have been suggested as a means to isolate shell-globbing meta-characters from Idio’s hungry hungry hippo dot operator.

ls -l #P{ *.txt }

I guess, in a similar fashion, we should consider different meta-characters although it would require re-working internally as glob(3) won’t change its meta-characters.

Of interest is the wildcard expression, like preparing a regular expression in Perl or Python, is not to be expanded until it is required.

You should you should be able to say:

the-txt-files := #P{ *.txt }

...add/delete .txt files...

ls -l the-txt-files

and only get the files extant at the time of invoking ls.

Work in progress.

Sorted Pathname Expansion

One thing we must be able to do is sort the result of pathname expansion. How many people have resorted to bespoke mechanisms to get the newest, largest, most whatever-est file from a set of files?

In Perl we can use the glob function to match filename patterns:

cmd (glob ("*.txt"));

and it’s not such a step to add an anonymous function to sort the results:

cmd (sort { ... } glob ("*.txt"));

That said, if we wanted to sort the files by modification time, say, we would want to hand the work off to a function that will glob the pattern, stat(2) each file (caching the results), sort the list by modification time from the stat results and return the sorted list:

cmd (do_sorting_stuff ("*.txt"));

I think we can do something similar in Idio. We need a little bit of help first in that we need an actual sorting function. GNU’s libc only supports qsort(3) but we can fall back on our Scheme friends and SRFI-95 gives use some useful sorting and merging functions, here:

function sort sequence less? [key]
Param sequence:

initial array sequence

Type sequence:

array or list

Param less?:

comparison predicate

Type less?:

function

Param key:

value to be sorted accessor

Type key:

function

which will return a new sequence sorted according to less? (a comparator function) where the value to be compared for each element of sequence can be retrieved by calling the function key with the element.

In other words, sort is smart enough to let you sort by something that isn’t the actual element in sequence but rather a value you can derive from each element.

With that indirection in mind, to sort by size or modification time etc. we need to stat(2) all the files in question and then call sort with an appropriate accessor into our stat’ed data.

That, in turn, requires not just the ability to call stat(2) but also access to something that can compare two C size_ts (or time_ts or …).

There is a libc/stat function which returns a C/pointer structure where the fields are named after their struct stat equivalents and a C/< function for those two.

Here’s a snippet showing the code for sorting by sb_size.

lib/path.idio
sort-size := #n              ; global scope

{
 sort-stats := #n

 key-stat-size := function (p) {
   sort-stats.p.sb_size
 }

 sort-size = function (l) {
   sort-stats = make-hash #n #n ((length l) + 10)

   for-each (function (p) {
     sort-stats.p = libc/stat p
   }) l

   sort l C/< key-stat-size
 }
}

Using, sort-size as an example, where it is called with a list of pathnames. It points the private variable sort-stats at a new hash big enough for the list of pathnames passed in and then walks over that list assigning the results of the libc/stat call to an entry indexed by the pathname.

It can now call sort with the original list, the C comparator function and a function that knows to access the sb_size field from the table of libc/stat results.

A reversed version of this sort could repeat the function with C/> although a more Schemely way would be to have the reversed function simply call reverse on the results of sort-size:

sort-size-reversed = function (l) {
  reverse (sort-size l)
}

before we take the full Lispy experience and note that these are all derivative and that I really only need to know a pair, (size sb_size) and I can put the rest in a template.

To complete the picture, we need a dynamic variable, say, ~glob-sort~, to be pointed at sort-size (or sort-mtime etc.) and for the pathname matching code to call it.

~glob-sort~ = sort-mtime

files := #P{ *.tgz }

rm -f files.0

(maybe apply a bit more rigour to your algorithm for choosing which file to rm…)

String Templates

String templates would act as a form of here-document rather than a template per se which is for code generation. For a string template we are looking to generate a string, duh!, rather than some code.

To some degree, the expansion of expressions in the template is derivative of the work done to convert values into strings for the code that executes external commands.

Here documents are useful for creating templated output, in particular, for other commands to use as input. Ignoring that we’re using awk’s stdin for the script:

awk << EOT
/pattern/ { print $USER, \$1; }
EOT

then we see an issue in that we now have a mixture of shell variables, $USER, and the targeted command’s variables which now require escaping, \$1 to prevent it being interpreted as a shell variable.

What we really want to do is create a string representing the input for the targeted command and have a distinct interpolation sigil for our own variables, for example:

awk << EOT
/pattern/ { print %USER, $1; }
EOT

So, in the same style as we’ve seen for templates we might try:

awk << #S%{
/pattern/ { print %USER, $1; }
}

with #S telling the reader there’s a string template coming, % changing the unquote sigil from $ to % and { through to } delimiting the string. (We can debate if the template should honour or elide leading and trailing whitespace).

It assumes that we (in Idio-land) have a USER variable to evaluate – which we quite likely do as most people have it set as an environment variable.

In this particular case, trying to have awk use its stdin for both the script and the stream it is trying to filter is doomed. Clearly what we need is something like Process Substitution but for strings.

Expansion

Of the various forms of shell expansion that we said we’d keep:

Command Substitution

An old favourite:

now=$(date +%Y%m%d-%H%M%S)

(missing a time zone indication etc., yes, I know)

Remember String Ports?

out := (open-output-string)
date +%Y%m%d-%H%M%S > out
now := get-output-string out

Where the IO redirection operator, >, is happy to do the right thing when given a string (for a filename) or a handle (file or string).

I guess we could do with another bit of syntax sugar to eliminate the temporary variable, out, how about:

collect-output date +%Y%m%d-%H%M%S

with the value returned by collect-output being the string of the output from the command.

This should work for pipelines too!

The IO redirection infix operator > does something sensible with:

  • strings: date > "foo" – creating the file, foo

  • handles, both file and string: date > handle

I suggested before using >> to capture “output” in a string, something like:

date +%Y%m%d-%H%M%S >> now

where the now variable should be assigned the result of the get-output-string on the temporary string handle (not shown but obviously like out above) used to capture the output.

It will confuse many people, mind.

Process Substitution

I haven’t written this yet but what in the shell looks like:

... < <(cmd args)

diff <(pipeline using file1) <(pipeline using file2)

to become:

... < /dev/fd/M

diff /dev/fd/M /dev/fd/N

is all timed rather nicely as process substitution occurs before IO redirection hence ... < <(cmd args) becoming ... < /dev/fd/M and then < can (re-)open the file as input for ....

I was wondering if

... <| cmd args

with <| symbolising the use of a named pipe for the following command might work – I guess >| for the output equivalent.

But that doesn’t quite wash as I can’t use that twice in the same statement unless the behavioural code behind <| is smart enough to spot the second <| (and >|) and we sort of assume that cmd args extends to the end of the line.

We could simply keep the parentheses and then the command is correctly delimited and we can have several on the same line.

... <(cmd args)

The problem here, though, is we need to convince the reader to do the right thing. It will have seen both < and ( to determine it is the named pipe IO variant. It can then read a list up to the corresponding ) and then it has:

rhubarb <( cmd-list rhubarb

Where the <( operator behaviour needs to handle the named pipe business.

Some thought required.

In the case of the diff example, here some more programming oriented developers might suggest that we can avoid named pipes altogether if we write our own diff using file descriptors to the pipelines.

We’re of a Unix-toolchain heritage, though, where if someone has created a useful tool then we should be using it. We’re also a shell where the overhead of running an external program is not a factor in the grand scheme [sic] of things.

The only real problem with Unix toolchain pipelines is the final form isn’t always easy to consume as it is often destined for the terminal and a user to pass a eye over. How do we handle the output of diff programmatically?

Jumping back to the awk issue mentioned just before in String Templates, you feel that whatever implements <(/>( could do with doing something sensible if it is given a string rather than a list:

  • create a temporary file

  • write the string to the file

  • open the file for reading

  • generate the /dev/fd/N form

  • run the command

  • clean up the file descriptor

  • remove the temporary file

That isn’t quite everything as the provocation to read a script from a file is different for every command. Here we might have said:

awk -f <( #S%{ ... } ) file+

which, because of the way the reader will consume multiple lines when reading a block might look like:

awk -f <( #S%{
  awk-ish %this with $that
  awk-ish the other
}) file+

which takes a little to get used to – in the sense that a command line trails over multiple lines – but, I think, works OK.

(Apart from using % as a sigil in any scripting language that uses printf.)

Modules

Do we need modules? Need is a very loaded term. They’re certainly very useful in terms of encapsulating ideas and, indeed, avoiding some unfortunate name clashes. Not that unfortunate name clashes are consigned to the bin but they can be reduced enormously.

Of course everyone and his dog has their own ideas about modules including Scheme – or was it the dog?

R7RS Scheme has developed the idea of a library involving expressive (and therefore complex) mechanisms to include bodies of code.

It’s one of those things where it seems too much for us just starting out. I fancy something simpler which, like so many things, may become deprecated over time.

At a high enough level what we want is to allow a developer to Mash The Keyboard™ to produce Art of which they intend to export (ie. make visible to their adoring public) only some names of the many many names they used in the implementation.

The gullible adoring masses can them import said Art and those exported names are now available for use locally.

Art.idio
module Art
export (
        beauty
        light
)

ugly := ...
beauty := not ugly

darkness := ...
light := not darkness

and then I can:

import Art

printf "shine a %s on %s\n" light darkness

and get an error about darkness being unbound. I haven’t defined darkness in my code and nothing I have imported has exported it. light on the other hand is just fine.

Of course we will still get name clashes. We are (read: I am) keen to poke about and get up to no good so I quite like the idea of getting access to read(2) for which I can write a primitive and export if from the libc module.

Hmm, Idio is quite likely to have a function called read in the reader. If I:

import libc

read ...

which read am I going to get?

The answer is… that depends. What the Idio engine will do is look to see if the symbol exists in the current module (which presupposes that we can change the current module) and use such a symbol if it exists. Next it will try each of the imported modules’ exported names lists looking for the symbol.

Note, here, that by default, every module imports the job-control and Idio modules. job-control because that handles | and various I/O redirection that everyone expects in a shell and Idio because that’s where most functions are defined.

So, the chances are you’re going to get the libc module’s read function as the reader’s read function is actually in the Idio module and would therefore only be found as a fallback if we hadn’t imported libc.

What if I want the other one? You can explicitly ask for it with:

import libc

module/read ...

so, Idio/read in this case.

You can’t ask for names that aren’t exported. That would be wrong.

Last built at 2024-12-30T07:10:55Z+0000 from 463152b (dev)