Reading Expressions¶
Reading individual expressions is where the action is.
Broadly, the kind of expression is determined by the first character we read, for example:
|
a list up to the matching |
|
a block up to the matching |
|
a string up to the next non-escaped |
|
something weird |
with a fallback of the accumulation of characters being a word (a symbol – but I prefer the shell-ish term, “word”).
In fact numbers fall into this word category as well (partly due to the funny number formats that Scheme accepts) and are differentiated from words because we can subsequently figure out it is a number.
(And there’s whitespace and all the rest of it.)
Reading of individual types is covered later in Simple Types and Compound Types we’re looking at the broader piece, here.
Talking of whitespace and all-encompassing words, Idio likes a bit of whitespace as I’ve noted before. If you’re not using whitespace then you’re creating an interesting symbol.
This means that some words (from a reader perspective) are not a lean form of arithmetic but are actual symbols (and possibly variables):
Idio> 3pi/4
#i2.35619449019234491e+0
3pi/4
is a symbol and is, perhaps not surprisingly, 3 * pi / 4
but now as a handy variable.
Word Separators¶
The inverse of expressions, perhaps, but it’s useful to know what prevents a word from consuming the entire rest of the file.
Many of these are sanity-preserving!
Whitespace¶
An obvious word break is whitespace although it’s currently being implemented with questionable adherence to a formal definition of whitespace characters.
In the first instance, born in the ASCII/Latin-1 era, whitespace was in two parts:
SPACE and TAB for gaps between words
NEWLINE and CARRIAGE RETURN for ends of line – which should cover the myriad of Unix, Mac OS and Windows variants
Of those two, I think I’ve only seen FORM FEED in the wild and only
then as ^L
in Emacs Lisp files.
However, even in those simplistic times it still didn’t honour ASCII’s VERTICAL TAB or FORM FEED.
So, a poor start and remains in exactly the same poor position until I can make a decision about whether to go all in with Unicode category types (which would take us up to about 25 whitespace code points).
Bracketing Characters¶
For right parenthesis, )
, and right bracket, ]
, where the
expressions in a list or an array might butt up against the closing
bracket then we can allow that if )
and ]
delimit a word.
Their left counterparts, (
and [
, are also word breaks meaning
that you cannot have, say, foo(bar
as a symbol. It’d be too
confusing in
(this foo(bar)
Both left and right brace, {
and }
, are not allowed in words
for the same sanity-retaining reasons.
Double Quote¶
Again, a "
is a word break to avoid symbols like foo"bar
.
Semicolon¶
This is the line-comment character, everything after it is discarded
to the end of the line so, again, we can avoid symbols like
foo;bar
and thus not be confused by:
this that foo;bar ; the other
Dot¶
You can’t have .
in a word. Unless it’s a number….
.
is used for the value-index
operator so we can say
thing.field
and value-index
will figure out the right
form of access of field
within thing
.
Escape¶
If you’re really determined you can escape any of the above (and the
escape character itself) in the source: a\;b
will create a symbol
called a;b
and a\\b
will create a symbol a\b
– although
you’ll continually need to use a\\b
in the source to manipulate
it.
Not a word Break¶
These are, maybe unexpectedly, not (currently) word breaks.
The pair separator, &
. You must currently 1) be in a list and 2)
separate the head and tail expressions from &
with whitespace.
That allows a&b
as a valid symbol.
I suppose this allows you to have a sequence of a
, a'
and
a''
if it pleases you so.
Interpolation characters which have no function other than as the
first character of a word in which case they are handled separately
anyway. a$b
, a@b
and a'b
(assuming the default
interpolation characters) are all valid symbols.
Lists¶
Ignoring the implied list in the overall line-oriented handling from above, reading a list expression is quite easy. So we’ll do a more complicated example.
The “…up to the matching…” part reflects the lists-within-lists nature of Lispy languages. When we read:
(+ (2 * 3) (6 / 2))
we’ll have:
identified this as a list because the first character is
(
so we’ll have called theidio_read_list()
function which reads “up to the matching”)
we’ll read the word
+
we start to read the next expression which begins with
(
so it’s another list and we simply recurse intoidio_read_list()
againwe can now read the words
2
then*
then3
then we hit
)
and we can construct and return a list from our three words:(2 * 3)
– I told you this becomes confusing!
the outer list gets a second expression,
(2 * 3)
we can read the next expression which also begins with
(
so off we go again:6
and/
and2
gives(6 / 2)
the outer list gets a third expression,
(6 / 2)
and finally we get our own
)
and can create and return a list from our own three expressions:(+ (2 * 3) (6 / 2))
we get
(+ (2 * 3) (6 / 2))
in our hands
The net result of which is a data structure in C whose printed representation is exactly what we read in from the source code…. However, it is now in C memory.
Blocks¶
Blocks are very similar to lists in the nested sense but they differ in that a block is expecting one or more line-oriented statements whereas a list is expecting one or more (simple) expressions.
Broadly, a block is describing a sequence of shell-ish commands to be run and a list is describing how an individual command or expression within a command should be constructed.
Blocks obviously contain lists (implicitly, if nothing else) but lists can also contain blocks as in the subsequent and alternate clauses of a conditional statement:
if { subsequent clauses } { alternate clauses }
Weird Stuff¶
There’s quite a range of weird stuff, introduced by #
, largely
because its a handy place to lump stuff and then no-one has to think
too hard – until we get conflicts for the semantic meaning of T
,
say.
By and large, #
introduces some kind of a constant:
#t
,#f
and#n
#\
starts a literal,#\a
, or named character,#\{space}
#[...]
is a “constant” array definition#{...}
is a “constant” hash table definition#B{...}
is a “constant” bitset definition#U+...
Unicode code pointnumber formats:
#b...
binary#d...
decimal#o...
octal#x...
hexadecimal#e...
an exact number#i...
an inexact number
#T{...}
a template#S{...}
an interpolated string#P"..."
a pathname (broken <sigh>)comments:
#*
a block comment through to the matching*#
#|
a semi-literate block comment (to be defined) through to the matching|#
#;
an expression comment
#<
provokes a reader error
“constant” is quoted as the expression is constant and can’t be modified but your use of it, as in:
arr := #[ 1 2 3]
has an implied copy made of the constant array expression.
I’m already thinking of some %
variants to #
: string
expansions, regular expressions, ….
Words and Numbers¶
This is following in the S9fES-style ([Hol14]) and seems to work quite well.
Everything not otherwise consumed as something specific (list, block,
string, weird, …) is consumed as a word. So pi
, 3pi
,
3pi/4
, 3.14
, 314e-2
and 3
are all initially read in as
words.
We then try to convert the word to a number, a bignum in particular. If our attempt to convert the number uses all of the characters of the word then the word becomes a number, otherwise it remains a word.
So, pi
doesn’t have any hope of being interpreted as a number so
remains a word. 3pi
starts promisingly with 3
but fails on
the pi
bit and remains a word. Ditto, 3pi/4
.
3.14
does satisfy the criteria for a number, no letters in this
case, one decimal point (in the right sort of place), all good. A
bignum in this case because it’s floating point.
314e-2
also becomes a bignum as the e
is a valid exponent
character and -2
is a valid exponent (ie. is an integer).
3
is a bit more interesting as we can determine it’s a number but
can also throw a couple of heuristics at it and see that it can be a
fixnum so we’ll return one of them instead.
However, those heuristics err on the side of bignums, so 3e0
which
has the value 3 and could therefore reasonably easily be a fixnum,
remains a bignum because the user gave us a number in the
“exponent-style” and were therefore showing intent that they didn’t
want us to do any funny business under their feet.
Last built at 2024-12-21T07:11:03Z+0000 from 463152b (dev)
Comments¶
Scheme, at least, comes with two or three comment types: line and (nestable) block comments and an expression comment.
Line Comments¶
I’ve kept the standard Lisp
;
as a comment character even though that clashes with the statement terminator from the shell.Without a statement terminator, one-liners (of which I’ve written “a few” myself) are impossible. That’ll make life interesting.
He says, avoiding looking.
On the other hand, wherever I’ve written a script version of my one liner I don’t recall a case where I’ve continued to use a one-liner in the script. I always flatten it out to the line-by-line mode that I would have written the script in in the first place.
I suspect that’s because mentally I am now committing myself to the permanence of the script and therefore I am pre-empting the inevitable addition of debugging and better support for edge cases that precludes the code being bunched up.
One-liners are transient hacks, right?
Block Comments¶
I like the idea of nested comments – meaning you aren’t annoyed by an inner comment that you’re enclosing causing the wider comment to end prematurely. However, I’m breaking with Scheme and using
#*
…*#
for generic block comments and reserving the (mutually aware and equally nested)#|
…|#
for some as yet-undocumented semi-literate programming style.Expression Comment¶
I’m not sure if this is universal but works very well for Schemes that do have it. Scheme, of course, has every expression in the shape of a form which makes life easy.
For Idio, it’s a little more tricky but the basic premise remains, we can comment out an individual expression with
#;
so that:becomes
Neat.