Idio Style¶
I can’t say I’m a huge fan of imposing style on other people but we should probably consider adhering to some conventions to make life easier for everyone approaching a piece of code and looking to make sense of it.
That said, I notice I am becoming (become?) a creature of habit resulting in moderately consistent look and feel. Much of which, I confess, is the result of either endemic casual errors or repeatedly having to go back and add debug statements such that I write the code expecting to have to come back to add the debug statement and so put the infrastructure in place first.
Some of this is a recognition of my own failings which will, of course, remain in the code until discovered.
Names¶
We have to create names in the code for which there are many
opportunities to sow confusion not because the name doesn’t describe
what the function is doing, say, but because the tense and
passive/active sense of the actors means that we can’t tell if someone
else has already solved this problem. Does array-ref
do the same
as ref-array
? Should we “ref” something or “get” it or something
else entirely? Does the absence of the one mean the functionality is
covered or that it needs to be written?
I’m going to propose that we have two headline rules (using imperative verbs):
verb - noun
Here we are performing some query or imperative action on a value – the whole value, not some sub-part of it:
make-array size [default] copy-array array
noun - verb or attribute or element
Here we are accessing some part or characteristic of a value – and in particular, not the whole value:
array-ref array index array-set! array index value array-fill! array value array-length array
array-ref
with the intention to return the entire array rather
than an element is semi-nonsensical as you would simply pass the array
itself. Similarly array-set!
with a whole array? What is the
intention, what are you trying to do?
By way of some possibly confusing examples, on the one hand, if we
want to set a particular attribute of a compound element then we use
the former variant, eg. set-ph!
because we are setting the
“entirety” of that attribute. On the other hand, from the
C API, the likes of struct-rlimit-set!
go the other way
because the function needs to be told which (of several) elements to
set (like setting an array element).
If we had defined an accessor for struct rlimit we might have said
set-struct-rlimit-rlim_cur!
, say. But we haven’t, so we can’t.
There are, of course, anomalies which could be fixed. Pairs/lists are
a case in point. The constructors are pair
& list
(rather
than make-pair
& make-list
) and the accessors, nominally,
pair-head
and pair-tail
, are truncated to ph
and pt
and their derivatives. These are used so often that if you didn’t
have some circumscribed name you would simply create it – hence,
phht
etc..
All of the predicates are exceptions though that’s because we can use a question mark in names:
array? array
On the subject of -ref
vs. -get
I fall into the former camp.
We are only making a reference to the underlying value. We are
not in any sense “getting” it and thus removing from the clutches
of another. There are any number of venerable texts on good English
grammar (whether you think such a thing is an oxymoron or a quixotic
cri de couer) which decry the use of “get” and “got”. Here it is
simply not accurate.
There are cases where it is (more) accurate as in
get-output-string
(to return the contents of a dynamic output
string). Here, the string you are given did not exist previously and,
indeed, if you call it again will return a different string (albeit
the content, the sequence of Unicode code points, is the same). You
might argue that it should be return-output-string
or
make-output-string
some other imperative verb-noun construction
but I think the intention is clear – and you are getting something
that no-one else will!
—
The Microsoft equivalent is the registry key
FullSecureChannelProtection=1
which
lives… somewhere.
I’m also a fan of meaningful user-facing names. I’m reading that the
magic flag in the Samba 4.x release for the Windows AD (actually
Netlogon protocol, hence Samba being affected) ZeroLogon bug is
server schannel = yes
where schannel
indicates a secure
netlogon channel. That’s really not very obvious, either the nearly
invisible leading s
or that it has anything to do with security or
netlogon. Text, like whitespace, is free.
Here, we could get into a mess about the semantics of configuration files, are we defining some set of parameters or making some (actionable) statement?
secure netlogon channel = yes
require secure netlogon channel
Either way, some clarity of intent makes life easier for everybody.
C Naming Convention¶
C should work under the same regulations with a couple of important additions which derive some some fanciful idea that Idio could be embedded in some larger program. What we need is for all Idio names to be unique in someone’s program.
There’s a simple fix for that: prefix everything with idio_
or
IDIO_
. Boom! Job done.
Actually, that requires a little clarification.
There’s a qualification to C macro names in that for the C API the transposed-from-C names are preserved, case-intact.
I’ve made all C preprocessor macros have names in upper case. That’s fairly common. There are a lot of quite involved macros (for me) and so if something is a macro it should stand out:
{
IDIO_ASSERT (o);
...
}
performs, as you might guess, some basic checks on the internal state
of the value o
. In fact what IDIO_ASSERT()
does varies
depending on how much debugging information is being compiled in.
For local macros you should make the macro name local using the file
name as an additional prefix plus any suitable functional prefix:
IDIO_READ_CHARACTER_SIMPLE
is a macro value in src/read.c
used as a flag in the code for inputting characters. In one sense,
CHARACTER_SIMPLE
is the useful macro part and IDIO_READ_
is a
distinguishing prefix.
All other C names with external linkage should be prefixed
with idio_
and it should be the case that all C static
names are too.
Hopefully, the only non-Idio-ified name in the code is
main
.
Lexical (scoped) names are a bit more problematic. As a standard lazy programmer I will use the shortest coherent name that makes sense in context. Of course, you only discover how effective that has been when you return to the code some time later to be left scratching your head as to what some variable represents.
I suspect I am a subconscious subscriber to initialism: l
for a list, p
for a pair, a
for an array, h
/ht
for a
hash (or hash table, depending on mood) etc.. I usually use r
for
the value to be returned (see below).
One variant on that is the surprisingly rare case of passing an Idio value into a function and extracting the underlying C value from it. You have two variables presenting the same value – just trussed up in different ways. This happens with integers but relatively rarely elsewhere (as C doesn’t support complex types).
Here, we should probably use some pseudo-Hungarian notation where
x_I
is the Idio variant and x_C
is the C
variant. I know, though, that I’ve used Ix
and Cx
and
elsewhere one variant doesn’t use a prefix/suffix and the other does.
This last variant is particularly prevalent where C functions are the primitive implementations of Idio functions. The C function will take arguments that should be named after the nominal Idio argument names. Here, then, the C variant gets the suffix.
I have used ci
for the C value of a “constant index” (or
index into the table of constants) and fci
for the Idio
fixnum variant.
Mea culpa.
Symbols¶
All of the standard Idio symbolic names/values are available
in C. They are all prefixed idio_S_
with the S
meaning “symbol”.
They aren’t all quite like-for-like but the majority are:
C |
Idio |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
There are a few other classes of Idio values that are used in C:
prefix |
usage |
---|---|
|
symbols – as above |
|
reader tokens |
|
intermediate code |
|
keywords |
You get the idea!
Idio Naming Convention¶
It would be nice to have some consistency in Idio names.
We’ve seen hints of a triplet of names associated with a type:
type-name?
, type-name-ref
and
type-name-set!
. That isn’t consistently applied as we saw
above with pair/list handling functions.
In the C API and for names in the libc
module I’ve tried to
use the C names where possible. This is partly from the
familiarity aspect in that the user doesn’t have to second guess how,
say, RLIMIT_NPROC
has been renamed and partly because, in many
cases, you can just cut’n’paste from the man page for the C
function.
Style¶
I say I’m not a fan of imposing style on other people but, being human and all, I do get annoyed when someone else’s code is formatted dramatically differently to mine. I wouldn’t say I find it unreadable but I have been known to re-indent code to understand it. Ludicrous!
However, some aspects of style represent completeness or thoroughness. Dot the i’s and cross the t’s.
Any condition or switch
statement should cover all
possibilities… including the “impossible”. It doesn’t take much
development to perturb the code and find yourself in the impossible
default
clause. You’ll be glad the code frames your issue in
uncompromising ways.
In other words, every if
should cover the “else” clause and
C’s switch
should have a default
clause. Or document
why they don’t.
C Indentation Style¶
My own style varies dependent on how enthusiastic I have been about
probing the depths of cc-mode.el
although these days I tend to
find a local minima between laziness, annoyance and satisfaction. My
.emacs
has been reduced to setting k&r
mode (whatever
that involves) and making the c-basic-offset
to be 4.
Although I’m still slightly annoyed.
{
¶
I’ve noted elsewhere that, other than functions, I usually have any
{
on the end of the line of the statement that introduces it:
if (...) {
...
}
I’m sure plenty of people dislike that C coding style but the one-liner nature of Idio means that trailing braces are de règle.
if¶
if
statements always use braces – even if it is simply
return
or break
.
That harks back to my debugging needs where inevitably you need to
identify which of several clauses caused the return so you need to
slip in a printf
statement or some other debug. I put the
infrastructure in first:
if (...) {
/* potential debug with r */
return r;
}
switch¶
A switch
statement should always have a default
clause with
some suitable admonition. This catches developer coding errors which
will get fixed and the default
clause can go back to wasting space
until the next developer comes along.
A rare exception is something like the printing of C types in
src/util.c
where, there, the outer switch only allows C
types into the clause and therefore there are fewer inside as the
clause is already guarded.
That said, I’ve managed to fall foul in forgetting to include one of
the C types further in… So, maybe a default
clause
everywhere?
case¶
I don’t like a case
statement without a break
or a return
– any “fall-through” code should be visually distinct.
switch {
case IDIO_A:
case IDIO_B:
...
break;
case IDIO_K:
case IDIO_L:
...
return;
default:
fprintf (stderr, ...);
break;
}
I think it would be beneficial, in general, if all possible case
alternatives are explicitly listed and in the order they are defined
in.
I fairly often have to introduce some lexical parameters on a
per-case
basis which also breaks my {
on the end of the line
rule. Which leads to the following:
switch {
case IDIO_A:
{
...
}
break;
default:
fprintf (stderr, ...);
break;
}
Notice the break
after the {
block.
structs¶
There’s only really one place where it is required but I have tried to
consistently use _s
for struct tag names and _t
for typedef’ed
names resulting in:
typedef struct idio_foo_s {
size_t size;
} idio_foo_t;
(The place it is required is defining compound types which need to
refer to the “Idio” type, struct idio_s *
, before it has been
typedef’d.)
Accessors¶
I have generally created macros to access the elements of the structures:
#define IDIO_FOO_SIZE(f) ((f) ... .size)
As I repeatedly change my mind about the contents of struct
idio_foo_s
and whether it is encompassed by or allocated in a wider
data structure and therefore whether the accessor is .size
or
->size
.
There should be very limited references to the elements of structures elsewhere in the code.
return¶
I discovered that I’m not very good at getting things right, either the first time or several subsequent times. Consequently almost all of my return clauses look something like:
IDIO r = idio_stuff ();
/* option to print debug with r */
return r;
I rely on the C compiler being able to optimise r
away.
Non-local returns¶
The C code will be invoking siglongjmp(3) but not everyone washing over the code will be able to spot that so especially where the code will never reach we need to clarify that.
There is a special idio_S_notreached
sentinel value for functions
returning an IDIO
value otherwise a similar comment and a
C sentinel value should be returned (unless the function is
void
, of course).
IDIO idio_foo (...)
{
if (...) {
idio_error_printf (...);
return idio_S_notreached;
}
}
or, for a function returning a regular C type, add a “notreached” comment and return a sentinel value:
int idio_bar (...)
{
if (...) {
idio_error_printf (...);
/* notreached */
return -1;
}
}
C Comparison Style¶
I have taken to making the left-hand-side of a comparison whichever value is a constant – the goal to avoid accidental assignments:
if (IDIO_TYPE_VALUE == s[i]) {
and I think it’s probably saved me a couple of times from the ignominy of:
if (s[i] = IDIO_TYPE_VALUE) {
It does require a bit of mental effort and it doesn’t scan as well: I
always have the sense of comparing something that I’ve just calculated
to something else which gives the variable == constant
thought process but is, of course, one character away from disaster.
For some comparisons, notably function calls, I revert to “normal”:
if (system_call (args) == -1) {
Idio Style¶
I’ve written a putative idio-mode.el
(in utils
) for Emacs
which broadly does the right thing.
The most prominent feature about being right is that the basic indent is two spaces.
Indented from what, is another question.
Two spaces, rather than four, because of the Idio/Scheme nature of more frequent use of sub-clauses wrapped by parentheses or braces.
Error Messages¶
Error messages should try to encode some context about the error. Hopefully, the error-raising code will provide some lexical information – and there are a couple of C macros to help developers – but the actual message itself may require some indication of the issue with a particular parameter.
Code Coverage¶
pole pole
I’m working through the C code looking at adding two classes of “tests”.
we want to be able to provoke each and every error generating clause.
These are generally the
test-X-errors
tests.we want to provoke complete code coverage
These are generally the
test-X-coverage
tests.“Complete” is impossible as gcov always marks the closing
}
of a function as “not run” because of thereturn
statement on the previous line.
These tests should be identified in the C code base together with any explanatory text.
For example, some errors are impossible to generate without breaking
the calling code as they fall in the default
clause of a
switch
statement and are in the “impossible” category. It should
be documented as “impossible” but the error raising code should
remain. A developer will fall into it.
Version Numbers¶
Version numbers are particularly vexing. I’m not sure I like any of the existing popular systems.
Of course, Python has PEP 440 – Version Identification and Dependency Specification which states that the canonical public version identifiers MUST comply with:
[N!]N(.N)*[{a|b|rc}N][.postN][.devN]
which, to be positive about it, only has the four optional parts.
The one thing I especially don’t like about the [{a|b|rc}N]
element is that it goes missing when the code is released, or,
possibly, the .rc3
is replaced with a .0
for some N(.N*)
prefix. (It’s not especially clear how they are meant to interact
before you even worry about alphabetics in a version number.)
That strikes me as something hard to rationalise, especially when you’re trying to sort version numbers. You could argue that sorting alphabetic version numbers isn’t a problem except for development teams by which you mean it is a problem just not for some people. The problem needs, *ahem*, sorting so why not go all in?
A numeric status scheme means that the alpha and beta stages
consume .0
and .1
and, via the release candidate, the
actual release gets a .3
(or later) suffix.
M.N.3
suggests a maturity the software probably doesn’t have
for a general access release.
If you’re going to step away from a pure Numeric status then why not
allow the full sequence a
, b
, c
… to give the developers
some flexibility before jumping to r
(for release candidate) and
settling on s
(for stable, say)? You can even have t
versions
and beyond should circumstances require.
This would have the (stable) release as, say, M.N.s.0
– a
reworking of the final release candidate, some M.N.r.3
, say.
I can see detractors declaiming that .s
serves no purpose as it is
(almost certainly) unchanging from here on in. Meh! It would appear
to be redundant to users but so is the trailing .n
as, in
the modern age, no user has any control over it. It will be updated
automatically by the operating system or an administrator.
In the rare occasions a user updates a piece of software themselves they’ll simply be picking the latest bits they can or, following advice, a specific version. No-one cares about the sub-version numbers.
Automated systems determining the latest version will simply sort the components.
In the meanwhile, developers are free to work their way through
M.N.a.n1
through M.N.c.n2
, say, before finally
suggesting there’s something worth the braver user experiencing with
M.N.r.n3
.
In the modern age I’d like to use α
(alpha), β
(beta), γ
(gamma) … through ρ
(rho) and σ
(sigma) with possibilities
for τ
etc. but I imagine that we’d hiccup over storing the version
numbers in filenames in some filesystems in a way that we could
reliably sort them.
*
OpenIndiana exercises a different scheme, this is SunOS in all its glory.
Exciting(?) alternatives might include the version numbers in Solaris Fault Management Resource Identifier such as the FMRI:
pkg://solaris/system/library@0.5.11,5.11-0.175.1.0.0.2.1:20120919T082311Z
which has the version number
0.5.11,5.11-0.175.1.0.0.2.1:20120919T082311Z
.
That is clearly and obviously:
0.5.11
, the component version – which might use (as this example does)uname -r
for operating system components or some dotted release number for non-OS-tied components5.11
, the build version,uname -r
(again)0.175.1.0.0.2.1
, the branch version which itself is:0.175
, the major release number with, here,0.175
meaning Oracle Solaris 111
, the update release number with0
being the first customer shipment0
, the Support Repository Update (SRU) number (bug fixes to paying customers)0
, reserved2
, the SRU build number (or re-spin number for a major release)1
, the nightly build number
20120919T082311Z
a timestamp, in the ISO-8601 style.
Is that too much?
*
CalVer, or Calendar Versioning, uses the date in some way. With a
sysadmin hat on you’d name accumulated (log, backup, whatever) files
with YYYYMMDD[-HHMMSS]
extensions so that some lexicographical
sorting by ls or *
in the shell gets you an easily
time-sorted list.
It also includes the Ubuntu-style release numbers 18.04
(2018,
April) which CentOS 7 picked up on.
I quite like a datestamp, possibly replacing a build or patch number,
although, in practice, you’d need a DNS-serial number-style revision
component YYYYMMDDREV
to handle multiple patches per day. After
all, if you can commit multiple changes per day your automated CI/CD
system will want automated release numbers per day.
Using a timestamp looks a bit more clumsy although, most of the time, it would avoid the revision component (but not guaranteed to do so).
Sorting¶
Sorting version numbers has been noted above and is a minefield. The easier you can make it, the better.
That said, one persistent problem is people using a lexicographical
sort on numeric fields. No, 1.12
is not “less than” 1.2
.
Stupid machine!
On the other hand, if you suggest a simple mechanism where you sort
fields numerically unless a field has an alphabetic then you sort
lexicographically then you bump up against the M.N.0
vs
M.N.rc1
problem. That’s partly what puts me in favour of always
having that alphabetic field, M.N.s.n
will always sort after
M.N.r.n
.
Marketing¶
A few people care about complete version numbers but most people are satisfied with the headline, “marketing” version number.
Python is mostly Python 2 or Python 3. I haven’t cared particularly whether it is Python 3.6 or Python 3.9 but more people will as the chances are there’s some useful functional changes between them where it matters which one you have.
Similarly with Perl. I dabbled a little with Perl 4 before Perl 5 became the norm. Perl 6 did appear on the scene but has been renamed Raku to avoid confusion. In the meanwhile, Perl 5 is Perl 5.10-something, probably. I don’t know, I don’t care.
The lack of concern on my part reflects that the minor changes between releases by and large don’t break changes in user code and that includes extensions using shared libraries.
Ubuntu’s CalVer numbering works well from a marketing point of view, particularly their LTS releases. People will describe 18.04 or 20.04 systems rather than the more exact 18.04.5 or 20.04.3.
A variation on the marketing theme is when the major version number is likely to be unchanging then it gets dropped. Solaris 11 is Solaris 5.11. Java 8 is Java 1.8. (Both these examples are Sun/Oracle – maybe not a consistent theme across the industry).
Microsoft did use tradition numbering for Windows 1.0 through 3.x, dropped that for the year oriented releases 95 (actually v4.0), 98 (v4.10) and 2000 (v5.0) then named releases ME (v4.90), XP (v5.1), Vista (v6.0) and NT (?) and then came back with numbers for 7 (v6.1), 8 (v6.2), 8.1 (v6.3) before resyncing with 10 (v10.0) and 11 (?).
*
Of course, what all of these marketing examples are showing is that the marketing name fixes the allowable internal features. No breaking user code! You can add more features but not break existing stuff.
That means the marketing name aspect of a version number, say,
M.N
, is fixed and developers have to work off that.
Does that mean a single M.N.x
before adding a s.n
? Or a
little more flexibility with M.N.x.y.s.n
?
The latter is getting a touch messy. Why do we have a marketing
number of M.N
anyway?
M.x.y.s.n
is still a mouthful but is in practice M.x.y
– which people are likely to call M.x
anyway – and supports a
fairly arbitrary development and release space with room for patching
with .n
.
Last built at 2024-12-21T07:11:06Z+0000 from 463152b (dev)