Pipelines and I/O Redirection¶
It’s worth a few words to describe the straight-forward task of implementing pipelines (hint: it’s not that straight-forward) and I/O redirection (not that straight-forward either).
| Operator¶
In the first instance, the |
operator doesn’t behave like most
other reader operators which “simply” re-arrange the elements of a
list with a view to calling a different but related function with the
results.
For reasons almost certainly involving unchecked enthusiasm and
limited self-control, the |
operator does the whole pipeline
malarkey itself. I suppose, technically, you could regard that as
simply re-arranging all the elements of the origin list in the reader
but it’s a little bit more than that.
It’s all defined in lib/job-control.idio
(with duplicating
efforts in src/job-control.c
as I wrote it first and some of
it is still used) and mostly from the directions in the
info pages on Job Control (in libc or read the online
Job Control pages) but there’s a few twists and turns.
I’m going to assume that we’re all happy with the basics of running a sub-program in the context of job control which involves prepping a process’ I/O and fiddling with the controlling terminal and foreground job.
That’s a given, right? If not, read the info pages,
realise that’s there’s nothing too surprising other than how
straight-forward it is and then read the code for fork-command
in
lib/job-control.info
and find it’s all reasonably one-for-one.
That’s for a simple foreground or background command. When it comes to pipelines there is a long row and a lot of ducks.
Processes¶
There’s more than meets the eye, here.
From the perspective of the pipeline generating code, being a template, it doesn’t know what the meaning of any of the sub-processes are:
rhubarb rhubarb | custard custard
It doesn’t know if any of that is non-external command Idio
code, say, printf
, or is an actual external command, say,
echo
, or a block of code that mixes external and non-external
commands together. It’s just stuff.
All that the pipeline operator can do is arrange for the whole “sub-process joined with a pipe to sub-process” malarkey to be set up and have the appropriate sub-process command expanded in situ.
The only extra we’re going to add is that after the expansion of the
command we’ll add an exit 0
otherwise the sub-process will carry
on running what the main process was about to do. That’s will get
messy very quickly!
generate pipe 1
fork 1
child process 1
input inherited from parent
output into pipe 1
expand command,
rhubard rhubard
add
exit 0
fork 2
child process 2
input from pipe 1
output inherited from parent
expand command,
custard custard
add
exit 0
As all of that was codified in a template then when it is run the
actual sub-process commands, rhubard rhubard
and custard
custard
, will each appear in their own child process with their I/O
redirected through a joining pipe(2).
Commands¶
What if rhubard rhubard
or custard custard
was an external
command or a block that included an external command?
Here, when the VM sees a symbol (or string) in functional position it
will call on the C code to run a child process, just as if
you had typed ls -l
at the command prompt.
The only difference is that this child process is a child of one of the child processes of the pipeline. Sub-process 1.1 or 2.1, say.
exit-on-error¶
All of the above is good apart from handling errors. If, say,
rhubard rhubard
was an external command then if it fails we really
want the left hand side of the pipeline to fail (preferably in the
same way) even though the process that has failed is a child of one
of the sub-processes of the pipeline.
Hence we have the exit-on-error mechanism whereby the default behaviour is for Idio to exit in the same that the errant sub-process did.
So, if rhubard rhubard
exits non-zero, or is killed, then we want
child process 1 to react to that by exiting in the same way.
Winding back upwards, if child process 1, of the pipeline, exits non-zero, or is killed, then the parent Idio process, the one that launched the pipeline through a template, will be signalled and it can handle the situation appropriately.
If rhubard rhubard
runs normally to completion and exits zero then
we’ll hit the deliberately added exit 0
statement which stops the
child process continuing back into the main code and, of course, means
that the parent Idio process, that launched the pipeline,
will see the left hand side exit zero and all is well.
There’ll be a small qualification, here, when we have to handle logical expressions.
pgrp-pipe¶
There’s a little process coordination trickery ported from Bash – noting that if the race condition occurs in a slick and fast pure C environment then our meandering and inefficient byte-compiled script has no hope!
We need to prevent the first process in the pipeline from starting until all the others are in place – otherwise you can have the first process, the process group leader, run and exit before you manage to start any of the others! That also leaves the small issue of trying to call setpgid(2) with a process group that no longer exists (and, worse, could have been replaced with a different process all together). That’s all bad for business.
The trick is to open another pipe and have the first process block reading from the pipe just before it starts. All the other processes in the pipeline simply close the pipe when they’re about to exec/start. When the last one has closed the pipe the first will have its blocking read(2) return (with zero bytes, ie EOF – we don’t care) and we’re good to go.
Incidentally, as we created the pipe in the main Idio process, it too
holds the pipe open so we can be fairly confident that all the child
processes are set up and running when the main Idio process closes the
(last?) pgrp-pipe
write-end.
Well, probably, there’s always a race condition between closing the
pgrp-pipe
and execve(2)ing or starting processing the
Idio code in which the last process can be prevented from
running before all the previous processes have been and gone. But
we’ve done a decent job, what more can we do?
Pipeline Bookkeeping¶
Most of the work in a pipeline is done with file descriptors whereas the caller is dealing with handles.
This is a reader operator, ie. a template and so our “variables”
will be unquoted entities to keep the pipeline as hygienic as
possible. Essentially, everything will have a $
in front of
it when the action starts.
There is an overall $pipeline-r-fd
and $pipeline-w-fd
– which
the first and last commands use (indirectly via the per-command input
and output file descriptors $cmd-r-fd
and $cmd-w-fd
).
$cmd-pipe
is the inter-cmd pipe, created by the left hand command
to generate its own output file descriptor, $cmd-w-fd
, and then to
leave trailing around the loop an input file descriptor,
$cmd-r-fd
, for the next command in the pipeline.
The $cmd-r-fd
is normally the pipe-reader of the previous
command’s $cmd-pipe
except, of course, the first command whose
$cmd-r-fd
is $pipeline-r-fd
.
The $cmd-w-fd
is normally the pipe-writer of of $cmd-pipe
except, of course, the last command whose $cmd-w-fd
is
$pipeline-w-fd
.
The $pipeline-[rw]-fd
, the overarching input and output file
descriptors of the entire pipeline, could be:
entities figured out by
(stdin-fileno)
and(stdout-fileno)
, ie.:regular C
STDIN_FILENO
andSTDOUT_FILENO
whatever
(current-input-handle)
and(current-output-handle)
actually areeither of which which could be string-handles (necessitating temporary files and some post-pipeline content recovery)
actual pipes!
future stuff (sockets are the obvious missing contender)
Meta-Commands¶
We can extend this model with some meta-commands, that is we can prefix the pipeline with a symbol to indicate some more bespoke behaviour is required.
Prefixing the pipeline continues an overall style of cmd
args
. where, in this case, args
happens to be an entire
pipeline.
You can use a number of meta-commands per pipeline although some conflict. The effectiveness depends on which is checked for first after the pipeline is successfully launched.
You have a similar concept to meta-commands in Bash in that a
pipeline can be preceded by time [-p]
.
fg-job / bg-job¶
'fg-job
is the default behaviour and is added for completeness.
'bg-job
will run the pipeline in the background. Obviously, if
the pipeline reads from its stdin or writes to its stdout while
backgrounded then the process group will get an appropriate signal,
SIGTTIN
or SIGTTOU
, respectively.
Note
These are not like, say, Bash’s fg
and bg
commands
which look at the set of outstanding jobs and run the targeted one
in the foreground or background.
bg-job
is more closely aligned with the function of the shell’s
&
operator, as in:
cmd args &
where in Idio you might write:
bg-job cmd args
pipe-into / pipe-from¶
If the caller passed 'pipe-into
or 'pipe-from
as the first
argument of the pipeline then $pipeline-r-fd
or $pipeline-w-fd
will be real pipes and the other ends of these are returned as pipe
handles, $pipeline-w-phn
or $pipeline-r-phn
, for the caller to
write into and read from the pipeline.
In practice, a $use-w-pipe
or $use-r-pipe
flag is set and we
set $pipeline-[rw]-fd
to the appropriate end of a
pipe(2) and assigning $pipeline-[wr]-phn
to a pipe
handle constructed from the other end to be returned to the caller.
These two meta-commands set the asynchronous command flag on the job.
As reasonable analogy would be something like "|cmd args"
or
"cmd args|"
in Perl. The second case is anathema to us
as the syntactic distinguisher, the |"
, is at the end of the
expression. That breaks our cmd args model.
I don’t have a particularly good answer for that. Maybe we need the reader to support:
|
a regular pipe |
|
syntactic sugar for |
|
syntactic sugar for |
|
reserved |
named-pipe-into / named-pipe-from¶
If the caller passed 'named-pipe-into
or 'named-pipe-from
as
the first argument of the pipeline then $pipeline-r-fd
or
$pipeline-w-fd
will be real pipes and the other ends of these are
returned as pathnames, $pipeline-w-phn
or $pipeline-r-phn
,
for the caller to open and write into and read from the pipeline.
In practice, the same $use-w-pipe
or $use-r-pipe
flag is set
to 'named
and we figure out a system dependent pathname to return.
See the section on Process Substitution for more details.
These two meta-commands set the asynchronous command flag on the job.
Following the above reader forms, perhaps:
|
syntactic sugar for |
|
syntactic sugar for |
These reader forms clearly only differ by a leading |
symbol from
their non-named variants. Not much.
I don’t feel too bad about that, though. This form, the named form,
implementing Process Substitution, is clearly a match for the
Bash-style for Process Substitution. The non-named
variant has a visual mnemonic: |>
, I feel, says pipe into.
collect-output¶
This is Command Substitution in that we’ll set $pipeline-w-fd
to
a temporary file to collect the output. We’ll use
libc/make-tmp-fd
which calls mkstemp(3) and removes the
temporary file name before returning the file descriptor.
When we’re done with the pipeline, we’ll slurp the contents of the temporary file back into a string handle then return the string contents of the string handle back to the caller.
collect-output
is always run as a foreground command – we’re
waiting for the results before we can continue, after all!
This meta-command sets the asynchronous command flag on the job.
time¶
All jobs have start and end statistics recorded including gettimeofday(2) and getrusage(2) for “self” and “children”.
The time
meta-command sets a report-timing
flag which is
tested by wait-for-job
which can report much like Bash’s
time
prefix.
I use time
in Bash all the, er, time although it wasn’t
until I added this meta-command that I came to realise that it’s not
all it’s cracked up to be.
The problem is that getrusage(2)’s RUSAGE_CHILDREN
represents all descendent processes that have terminated and been
waited on. (It does not include any not yet terminated and waited on
descendents.)
So, here at least, we assume it was the job we just ran but maybe it
was also the other jobs that just finished and wait-for-job
has
run for off the back of a SIGCHLD
:
Return resource usage statistics for all children of the calling process that have terminated and been waited for.
—getrusage(2)
We’re the calling process and we’ve lots of children. Hmm.
I dunno. I guess if you’re pretty confident that you only had one thing running (and now completed) then this usage is probably fairly accurate.
If you’d fired off a dozen jobs in the background that completed
whilst this job was running then you can be less confident (as can
wait-for-job
if it is reporting on those jobs).
On the plus side, tv-real
(the elapsed real time) should be
accurate. Erm, except that the end time is dependent on when
wait-for-job
wakes up.
So, use it as a hand-wavy, about this much time.
Meta-Command Preferences¶
The meta-commands are utilised in the following order (in order words, if you have set two conflicting meta-commands then the first listed is used):
named-pipe-into
returning a pathname
pipe-into
returning an output pipe handle
named-pipe-from
returning a pathname
pipe-from
returning an input pipe handle
collect-from
which always runs the job in the foreground
returning the stripped collected output discarding trailing newlines, U+000A (LINE FEED)
fg-job
returning the job status (
#t
or#f
)bg-job
returning
#t
Further Meta-Commands¶
With these sorts of options to hand we can start to imagine some more exotic forms.
Suppose we wanted to return a named-pipe pathname to a command and yet retain control over what is written by also getting the other end of the pipe returned to us?
You can imagine a sort of:
diff file <( ... )
where, instead of ...
being an asynchronous command, we somehow
get an output pipe handle to give diff some input through
the named pipe of that second argument.
Well, we can’t return two things and certainly not to two different people but maybe there’s a trick we can pull?
Suppose we passed a function expecting a single argument (the putative
output pipe handle) as ...
? Under those circumstances, the code
could detect that the value passed was a function and call it with the
appropriate end of the pipe.
The function would be expected to assign the value to a variable in scope, a trick with “private variables” we’ve played before.
The meta-command code can continue on and return the named pipe as an argument to the command as it would have done normally.
At this point however, we get a disturbance in the force.
diff wants to run in the foreground reading from
/dev/fd/X
or whatever as it would have done normally and
we now want to run in the foreground in order to supply
diff with some input.
One or other is going to have to be run the in the background which is going to be problematic if whichever wants to write to the terminal. Let’s go with the command being run in the background – a complete volte-face from previous behaviour – but we are doing something exotic!
You can imagine something like:
oph := #f
define (set-oph ph) {
oph = ph
}
diff file <( set-opf )
hprintf oph "hello\n"
close-handle oph
Now we have to wait for and get the exit status of diff – assuming it hasn’t been stopped for trying to write to the terminal.
This mechanism is not too far fetched from the various “Coprocess” facilities from newer Bash’s and Ksh although we are retaining a connection with a specific argument in the command’s arguments.
Coprocesses themselves should be equally feasible.
Asynchronous Command House-Keeping¶
There’s a tiny bit more annoyance with asynchronous commands, in particular with the pipe we create to pass data into or out of the asynchronous command.
This is going to get messy so let’s take an example and work it through:
oph := pipe-into {
cat > file
do something else
}
We obviously create the connectivity pipe in the parent Idio as it is that parent which is going to utilise the pipe (even if that is just giving the pipe (handle or name) to someone else, it, the parent, needs the one end). Here, the parent wants the write end of the connectivity pipe and will return an output pipe handle for the user to write into.
We now fork(2) a sub-Idio to run through the asynchronous command itself. It will dutifully remap its own stdin or stdout according to the asynchronous connectivity and close the corresponding read or write end of the pipe. So, here, the sub-Idio will dup2(2) the read end of the connectivity pipe to stdin for the asynchronous commands and will close(2) the original connectivity pipe file descriptor (10, or whatever, who cares?).
We now run the asynchronous commands. cat will read its
stdin and print the contents to stdout which we happen to have
redirected to file
.
But wait, cat hangs. What’s going on? Well, cat reads from its stdin until it gets an EOF. If we strace it we see it happily blocked reading from file descriptor 0. “ENOEOF” – if we could make up new error codes.
The problem, here, is that cat’s stdin is a pipe and the problem with pipes is that they remain open, without an EOF indicator, whilst anyone still holds a write end open.
That’s where we’ve made our mistake. The parent Idio has the write end open – otherwise we’re not going to do much “piping into” – and then we forked. The sub-Idio also has the write end open. In fact, if you nose around carefully, cat has the write end open too as it was forked from the sub-Idio. D’oh!
So, our extra bit of house-keeping for asynchronous commands is to ensure that the sub-Idio closes down the other end of the connectivity pipe from the one it needs. Or, another way of describing it, is that the sub-Idio needs to close the end of the connectivity pipe that the parent Idio will be keeping open.
Simple Commands¶
Reducing the pipeline down to its simplest case causes us a problem. Consider:
Idio> pipe-from cat file
There is no |
symbol so no |
reader operator gets invoked. We
will now try to find and execute the external command pipe-from
which will probably(!) fail.
To this end we need to rustle up some single-instance meta-commands:
fg-job
, bg-job
, pipe-into
, pipe-from
and
collect-output
most of which are wrappers to a generic
fork-command
which runs job control much like the main pipeline
operator does.
collect-output
is slightly different in that it runs a combination
of with-output-to
(a variation on with-handle-redir
, below)
and fg-job
to force the command into a sub-program.
Exiting¶
With our book keeping hats on we do have a minor problem when we exit in that there could be jobs lying around for which we haven’t called wait(2). This is particularly prevalent if they were run immediately before we exit.
What should we do with them? The info pages skip over this but we can
take a look at what Bash does which is to terminate all
stopped jobs: you can follow exit_shell() in shell.c
to
end_job_control() in jobs.c
to terminate_stopped_jobs().
Of course that means that any backgrounded not-stopped jobs continue.
When I typed that, I’ll be honest, I had a mental panic thinking, “What a disaster! Are we not in control?” as I’d not thought about the functionality using that expression and couldn’t see the wood for the trees.
Of course, running commands in the background is the very essence of most initialisation scripts. Their only purpose is to prepare the ground for the daemon to be kicked off then exit themselves.
Of course, with our systemd hats on, there’s a preference for initialisation scripts to not run daemons in the background so as systemd can better keep an eye on them.
In addition the system will terminate any outstanding Asynchronous Commands.
HUPing¶
In a similar vein, if we are sent a SIGHUP signal then we should pass it onto to all of our jobs and then quit.
I/O Redirection¶
Our problem with I/O redirection isn’t the redirection, mostly, but that we don’t, in general, use file descriptors. In fact, we happily deal with string handles which don’t have any file component whatsoever.
Note that:
this only works for the standard I/O stream, stdin, stdout and stderr
The underlying problem being that I didn’t want to deal with, say, Bash’s arbitrary redirection syntax,
[n]>[word]
, which has too much ambiguity for the Idio reader.only works after the command and arguments, as in
ls -l > osh
Again, Bash’s syntax is too flexible (for me).
We have two forms of I/O redirection:
actual I/O redirection, as in
>
, which we deal with inwith-handle-redir
I/O duplication, as in
>&
, which we deal with inwith-handle-dup
These two should probably be merged via some clever coding.
with-handle-redir¶
The actual I/O redirection operators, <
, >
and 2>
,
determine that they affect a type of 'input
, 'output
or
'error
and take the first expression after as the target.
They then package the command before the redirection statement in the
form of a thunk and pass it to with-handle-redir
with the type and
target arguments.
with-handle-redir
then defines a bunch of accessor methods based
on the type and then look at the target:
if target is the correct direction (input or output!) file descriptor handle then we can use it
Note that all file handles are (by definition?) file descriptor handles.
This also covers pipe handles which are also file descriptor handles.
otherwise if target is a file descriptor handle then it must be in error (as it should have passed in the previous test)
if target is the correct direction (input or output!) string handle then we can use it
otherwise if target is a string handle then it must be in error (as it should have passed in the previous test)
if target is a string then it is deemed a filename and opened (for input or output)
if target is
#n
then it is a nickname for “/dev/null” which is opened (for input or output)otherwise, target is an invalid value
Assuming we still going then we set the appropriate handle
(set-input-handle!
, set-output-handle!
or
set-error-handle!
) and run the thunk.
Finally, set the handle back.
Notice that we only set the handle. We don’t deal in file descriptors, here. If (and only if?) we get round to trying to actually run an external command to we actually care about file descriptors.
If we had redirected the output of some expression into a string handle and the expression was some Idio code then there’s no need for any file descriptors to get involved. Only if the expression reduced down to an external command will we start considering file descriptors. In the particular case of a string handle, we need to indirect through a temporary file as well.
with-handle-dup¶
The actual I/O redirection operators, <&
, >&
and 2>&
,
determine that they affect a type of 'input
, 'output
or
'error
and take the first expression after as the target.
They then package the command before the redirection statement in the
form of a thunk and pass it to with-handle-dup
with the type and
target arguments.
with-handle-dup
then defines a bunch of accessor methods based
on the type and then look at the target:
if target is the correct direction (input or output!) file descriptor handle then we can use it
Note that all file handles are (by definition?) file descriptor handles.
This also covers pipe handles which are also file descriptor handles.
otherwise if target is a file descriptor handle then it must be in error (as it should have passed in the previous test)
if target is the correct direction (input or output!) string handle then we can use it
otherwise if target is a string handle then it must be in error (as it should have passed in the previous test)
if target is a
C/int
or fixnum with a value of 0, 1 or 2 then the current input, output or error handle is used as the target handleotherwise, target is an invalid value
Assuming we still going then we set the appropriate handle
(set-input-handle!
, set-output-handle!
or
set-error-handle!
) and run the thunk.
Finally, set the handle back.
Again, we’re only dealing with handles, here, no file descriptors.
Last built at 2024-12-21T07:11:04Z+0000 from 463152b (dev)