Pipelines and I/O Redirection

It’s worth a few words to describe the straight-forward task of implementing pipelines (hint: it’s not that straight-forward) and I/O redirection (not that straight-forward either).

| Operator

In the first instance, the | operator doesn’t behave like most other reader operators which “simply” re-arrange the elements of a list with a view to calling a different but related function with the results.

For reasons almost certainly involving unchecked enthusiasm and limited self-control, the | operator does the whole pipeline malarkey itself. I suppose, technically, you could regard that as simply re-arranging all the elements of the origin list in the reader but it’s a little bit more than that.

It’s all defined in lib/job-control.idio (with duplicating efforts in src/job-control.c as I wrote it first and some of it is still used) and mostly from the directions in the info pages on Job Control (in libc or read the online Job Control pages) but there’s a few twists and turns.

I’m going to assume that we’re all happy with the basics of running a sub-program in the context of job control which involves prepping a process’ I/O and fiddling with the controlling terminal and foreground job.

That’s a given, right? If not, read the info pages, realise that’s there’s nothing too surprising other than how straight-forward it is and then read the code for fork-command in lib/job-control.info and find it’s all reasonably one-for-one.

That’s for a simple foreground or background command. When it comes to pipelines there is a long row and a lot of ducks.

Processes

There’s more than meets the eye, here.

From the perspective of the pipeline generating code, being a template, it doesn’t know what the meaning of any of the sub-processes are:

rhubarb rhubarb | custard custard

It doesn’t know if any of that is non-external command Idio code, say, printf, or is an actual external command, say, echo, or a block of code that mixes external and non-external commands together. It’s just stuff.

All that the pipeline operator can do is arrange for the whole “sub-process joined with a pipe to sub-process” malarkey to be set up and have the appropriate sub-process command expanded in situ.

The only extra we’re going to add is that after the expansion of the command we’ll add an exit 0 otherwise the sub-process will carry on running what the main process was about to do. That’s will get messy very quickly!

  • generate pipe 1

  • fork 1

    • child process 1

      • input inherited from parent

      • output into pipe 1

      • expand command, rhubard rhubard

      • add exit 0

  • fork 2

    • child process 2

      • input from pipe 1

      • output inherited from parent

      • expand command, custard custard

      • add exit 0

As all of that was codified in a template then when it is run the actual sub-process commands, rhubard rhubard and custard custard, will each appear in their own child process with their I/O redirected through a joining pipe(2).

Commands

What if rhubard rhubard or custard custard was an external command or a block that included an external command?

Here, when the VM sees a symbol (or string) in functional position it will call on the C code to run a child process, just as if you had typed ls -l at the command prompt.

The only difference is that this child process is a child of one of the child processes of the pipeline. Sub-process 1.1 or 2.1, say.

exit-on-error

All of the above is good apart from handling errors. If, say, rhubard rhubard was an external command then if it fails we really want the left hand side of the pipeline to fail (preferably in the same way) even though the process that has failed is a child of one of the sub-processes of the pipeline.

Hence we have the exit-on-error mechanism whereby the default behaviour is for Idio to exit in the same that the errant sub-process did.

So, if rhubard rhubard exits non-zero, or is killed, then we want child process 1 to react to that by exiting in the same way.

Winding back upwards, if child process 1, of the pipeline, exits non-zero, or is killed, then the parent Idio process, the one that launched the pipeline through a template, will be signalled and it can handle the situation appropriately.

If rhubard rhubard runs normally to completion and exits zero then we’ll hit the deliberately added exit 0 statement which stops the child process continuing back into the main code and, of course, means that the parent Idio process, that launched the pipeline, will see the left hand side exit zero and all is well.

There’ll be a small qualification, here, when we have to handle logical expressions.

pgrp-pipe

There’s a little process coordination trickery ported from Bash – noting that if the race condition occurs in a slick and fast pure C environment then our meandering and inefficient byte-compiled script has no hope!

We need to prevent the first process in the pipeline from starting until all the others are in place – otherwise you can have the first process, the process group leader, run and exit before you manage to start any of the others! That also leaves the small issue of trying to call setpgid(2) with a process group that no longer exists (and, worse, could have been replaced with a different process all together). That’s all bad for business.

The trick is to open another pipe and have the first process block reading from the pipe just before it starts. All the other processes in the pipeline simply close the pipe when they’re about to exec/start. When the last one has closed the pipe the first will have its blocking read(2) return (with zero bytes, ie EOF – we don’t care) and we’re good to go.

Incidentally, as we created the pipe in the main Idio process, it too holds the pipe open so we can be fairly confident that all the child processes are set up and running when the main Idio process closes the (last?) pgrp-pipe write-end.

Well, probably, there’s always a race condition between closing the pgrp-pipe and execve(2)ing or starting processing the Idio code in which the last process can be prevented from running before all the previous processes have been and gone. But we’ve done a decent job, what more can we do?

Pipeline Bookkeeping

Most of the work in a pipeline is done with file descriptors whereas the caller is dealing with handles.

There is an overall $pipeline-r-fd and $pipeline-w-fd – which the first and last commands use (indirectly via the per-command input and output file descriptors $cmd-r-fd and $cmd-w-fd).

$cmd-pipe is the inter-cmd pipe, created by the left hand command to generate its own output file descriptor, $cmd-w-fd, and then to leave trailing around the loop an input file descriptor, $cmd-r-fd, for the next command in the pipeline.

The $cmd-r-fd is normally the pipe-reader of the previous command’s $cmd-pipe except, of course, the first command whose $cmd-r-fd is $pipeline-r-fd.

The $cmd-w-fd is normally the pipe-writer of of $cmd-pipe except, of course, the last command whose $cmd-w-fd is $pipeline-w-fd.

The $pipeline-[rw]-fd, the overarching input and output file descriptors of the entire pipeline, could be:

  1. entities figured out by (stdin-fileno) and (stdout-fileno), ie.:

    • regular C STDIN_FILENO and STDOUT_FILENO

    • whatever (current-input-handle) and (current-output-handle) actually are

      • either of which which could be string-handles (necessitating temporary files and some post-pipeline content recovery)

  2. actual pipes!

  3. future stuff (sockets are the obvious missing contender)

Meta-Commands

We can extend this model with some meta-commands, that is we can prefix the pipeline with a symbol to indicate some more bespoke behaviour is required.

Prefixing the pipeline continues an overall style of cmd args. where, in this case, args happens to be an entire pipeline.

You can use a number of meta-commands per pipeline although some conflict. The effectiveness depends on which is checked for first after the pipeline is successfully launched.

You have a similar concept to meta-commands in Bash in that a pipeline can be preceded by time [-p].

fg-job / bg-job

'fg-job is the default behaviour and is added for completeness.

'bg-job will run the pipeline in the background. Obviously, if the pipeline reads from its stdin or writes to its stdout while backgrounded then the process group will get an appropriate signal, SIGTTIN or SIGTTOU, respectively.

Note

These are not like, say, Bash’s fg and bg commands which look at the set of outstanding jobs and run the targeted one in the foreground or background.

bg-job is more closely aligned with the function of the shell’s & operator, as in:

cmd args &

where in Idio you might write:

bg-job cmd args

pipe-into / pipe-from

If the caller passed 'pipe-into or 'pipe-from as the first argument of the pipeline then $pipeline-r-fd or $pipeline-w-fd will be real pipes and the other ends of these are returned as pipe handles, $pipeline-w-phn or $pipeline-r-phn, for the caller to write into and read from the pipeline.

In practice, a $use-w-pipe or $use-r-pipe flag is set and we set $pipeline-[rw]-fd to the appropriate end of a pipe(2) and assigning $pipeline-[wr]-phn to a pipe handle constructed from the other end to be returned to the caller.

These two meta-commands set the asynchronous command flag on the job.

As reasonable analogy would be something like "|cmd args" or "cmd args|" in Perl. The second case is anathema to us as the syntactic distinguisher, the |", is at the end of the expression. That breaks our cmd args model.

I don’t have a particularly good answer for that. Maybe we need the reader to support:

|

a regular pipe

|>{...} |>(...)

syntactic sugar for pipe-into

|<{...} |<(...)

syntactic sugar for pipe-from

||

reserved

named-pipe-into / named-pipe-from

If the caller passed 'named-pipe-into or 'named-pipe-from as the first argument of the pipeline then $pipeline-r-fd or $pipeline-w-fd will be real pipes and the other ends of these are returned as pathnames, $pipeline-w-phn or $pipeline-r-phn, for the caller to open and write into and read from the pipeline.

In practice, the same $use-w-pipe or $use-r-pipe flag is set to 'named and we figure out a system dependent pathname to return. See the section on Process Substitution for more details.

These two meta-commands set the asynchronous command flag on the job.

Following the above reader forms, perhaps:

>{...} >(...)

syntactic sugar for named-pipe-into

<{...} <(...)

syntactic sugar for named-pipe-from

These reader forms clearly only differ by a leading | symbol from their non-named variants. Not much.

I don’t feel too bad about that, though. This form, the named form, implementing Process Substitution, is clearly a match for the Bash-style for Process Substitution. The non-named variant has a visual mnemonic: |>, I feel, says pipe into.

collect-output

This is Command Substitution in that we’ll set $pipeline-w-fd to a temporary file to collect the output. We’ll use libc/make-tmp-fd which calls mkstemp(3) and removes the temporary file name before returning the file descriptor.

When we’re done with the pipeline, we’ll slurp the contents of the temporary file back into a string handle then return the string contents of the string handle back to the caller.

collect-output is always run as a foreground command – we’re waiting for the results before we can continue, after all!

This meta-command sets the asynchronous command flag on the job.

time

All jobs have start and end statistics recorded including gettimeofday(2) and getrusage(2) for “self” and “children”.

The time meta-command sets a report-timing flag which is tested by wait-for-job which can report much like Bash’s time prefix.

I use time in Bash all the, er, time although it wasn’t until I added this meta-command that I came to realise that it’s not all it’s cracked up to be.

The problem is that getrusage(2)’s RUSAGE_CHILDREN represents all descendent processes that have terminated and been waited on. (It does not include any not yet terminated and waited on descendents.)

So, here at least, we assume it was the job we just ran but maybe it was also the other jobs that just finished and wait-for-job has run for off the back of a SIGCHLD:

Return resource usage statistics for all children of the calling process that have terminated and been waited for.

getrusage(2)

We’re the calling process and we’ve lots of children. Hmm.

I dunno. I guess if you’re pretty confident that you only had one thing running (and now completed) then this usage is probably fairly accurate.

If you’d fired off a dozen jobs in the background that completed whilst this job was running then you can be less confident (as can wait-for-job if it is reporting on those jobs).

On the plus side, tv-real (the elapsed real time) should be accurate. Erm, except that the end time is dependent on when wait-for-job wakes up.

So, use it as a hand-wavy, about this much time.

Meta-Command Preferences

The meta-commands are utilised in the following order (in order words, if you have set two conflicting meta-commands then the first listed is used):

  1. named-pipe-into

    returning a pathname

  2. pipe-into

    returning an output pipe handle

  3. named-pipe-from

    returning a pathname

  4. pipe-from

    returning an input pipe handle

  5. collect-from

    which always runs the job in the foreground

    returning the stripped collected output discarding trailing newlines, U+000A (LINE FEED)

  6. fg-job

    returning the job status (#t or #f)

  7. bg-job

    returning #t

Further Meta-Commands

With these sorts of options to hand we can start to imagine some more exotic forms.

Suppose we wanted to return a named-pipe pathname to a command and yet retain control over what is written by also getting the other end of the pipe returned to us?

You can imagine a sort of:

diff file <( ... )

where, instead of ... being an asynchronous command, we somehow get an output pipe handle to give diff some input through the named pipe of that second argument.

Well, we can’t return two things and certainly not to two different people but maybe there’s a trick we can pull?

Suppose we passed a function expecting a single argument (the putative output pipe handle) as ...? Under those circumstances, the code could detect that the value passed was a function and call it with the appropriate end of the pipe.

The function would be expected to assign the value to a variable in scope, a trick with “private variables” we’ve played before.

The meta-command code can continue on and return the named pipe as an argument to the command as it would have done normally.

At this point however, we get a disturbance in the force. diff wants to run in the foreground reading from /dev/fd/X or whatever as it would have done normally and we now want to run in the foreground in order to supply diff with some input.

One or other is going to have to be run the in the background which is going to be problematic if whichever wants to write to the terminal. Let’s go with the command being run in the background – a complete volte-face from previous behaviour – but we are doing something exotic!

You can imagine something like:

oph := #f

define (set-oph ph) {
  oph = ph
}

diff file <( set-opf )

hprintf oph "hello\n"

close-handle oph

Now we have to wait for and get the exit status of diff – assuming it hasn’t been stopped for trying to write to the terminal.

This mechanism is not too far fetched from the various “Coprocess” facilities from newer Bash’s and Ksh although we are retaining a connection with a specific argument in the command’s arguments.

Coprocesses themselves should be equally feasible.

Asynchronous Command House-Keeping

There’s a tiny bit more annoyance with asynchronous commands, in particular with the pipe we create to pass data into or out of the asynchronous command.

This is going to get messy so let’s take an example and work it through:

oph := pipe-into {
  cat > file
  do something else
}

We obviously create the connectivity pipe in the parent Idio as it is that parent which is going to utilise the pipe (even if that is just giving the pipe (handle or name) to someone else, it, the parent, needs the one end). Here, the parent wants the write end of the connectivity pipe and will return an output pipe handle for the user to write into.

We now fork(2) a sub-Idio to run through the asynchronous command itself. It will dutifully remap its own stdin or stdout according to the asynchronous connectivity and close the corresponding read or write end of the pipe. So, here, the sub-Idio will dup2(2) the read end of the connectivity pipe to stdin for the asynchronous commands and will close(2) the original connectivity pipe file descriptor (10, or whatever, who cares?).

We now run the asynchronous commands. cat will read its stdin and print the contents to stdout which we happen to have redirected to file.

But wait, cat hangs. What’s going on? Well, cat reads from its stdin until it gets an EOF. If we strace it we see it happily blocked reading from file descriptor 0. “ENOEOF” – if we could make up new error codes.

The problem, here, is that cat’s stdin is a pipe and the problem with pipes is that they remain open, without an EOF indicator, whilst anyone still holds a write end open.

That’s where we’ve made our mistake. The parent Idio has the write end open – otherwise we’re not going to do much “piping into” – and then we forked. The sub-Idio also has the write end open. In fact, if you nose around carefully, cat has the write end open too as it was forked from the sub-Idio. D’oh!

So, our extra bit of house-keeping for asynchronous commands is to ensure that the sub-Idio closes down the other end of the connectivity pipe from the one it needs. Or, another way of describing it, is that the sub-Idio needs to close the end of the connectivity pipe that the parent Idio will be keeping open.

Simple Commands

Reducing the pipeline down to its simplest case causes us a problem. Consider:

Idio> pipe-from cat file

There is no | symbol so no | reader operator gets invoked. We will now try to find and execute the external command pipe-from which will probably(!) fail.

To this end we need to rustle up some single-instance meta-commands: fg-job, bg-job, pipe-into, pipe-from and collect-output most of which are wrappers to a generic fork-command which runs job control much like the main pipeline operator does.

collect-output is slightly different in that it runs a combination of with-output-to (a variation on with-handle-redir, below) and fg-job to force the command into a sub-program.

Exiting

With our book keeping hats on we do have a minor problem when we exit in that there could be jobs lying around for which we haven’t called wait(2). This is particularly prevalent if they were run immediately before we exit.

What should we do with them? The info pages skip over this but we can take a look at what Bash does which is to terminate all stopped jobs: you can follow exit_shell() in shell.c to end_job_control() in jobs.c to terminate_stopped_jobs().

Of course that means that any backgrounded not-stopped jobs continue.

When I typed that, I’ll be honest, I had a mental panic thinking, “What a disaster! Are we not in control?” as I’d not thought about the functionality using that expression and couldn’t see the wood for the trees.

Of course, running commands in the background is the very essence of most initialisation scripts. Their only purpose is to prepare the ground for the daemon to be kicked off then exit themselves.

Of course, with our systemd hats on, there’s a preference for initialisation scripts to not run daemons in the background so as systemd can better keep an eye on them.

In addition the system will terminate any outstanding Asynchronous Commands.

HUPing

In a similar vein, if we are sent a SIGHUP signal then we should pass it onto to all of our jobs and then quit.

I/O Redirection

Our problem with I/O redirection isn’t the redirection, mostly, but that we don’t, in general, use file descriptors. In fact, we happily deal with string handles which don’t have any file component whatsoever.

Note that:

  • this only works for the standard I/O stream, stdin, stdout and stderr

    The underlying problem being that I didn’t want to deal with, say, Bash’s arbitrary redirection syntax, [n]>[word], which has too much ambiguity for the Idio reader.

  • only works after the command and arguments, as in ls -l > osh

    Again, Bash’s syntax is too flexible (for me).

We have two forms of I/O redirection:

  1. actual I/O redirection, as in >, which we deal with in with-handle-redir

  2. I/O duplication, as in >&, which we deal with in with-handle-dup

These two should probably be merged via some clever coding.

with-handle-redir

The actual I/O redirection operators, <, > and 2>, determine that they affect a type of 'input, 'output or 'error and take the first expression after as the target.

They then package the command before the redirection statement in the form of a thunk and pass it to with-handle-redir with the type and target arguments.

with-handle-redir then defines a bunch of accessor methods based on the type and then look at the target:

  • if target is the correct direction (input or output!) file descriptor handle then we can use it

    Note that all file handles are (by definition?) file descriptor handles.

    This also covers pipe handles which are also file descriptor handles.

  • otherwise if target is a file descriptor handle then it must be in error (as it should have passed in the previous test)

  • if target is the correct direction (input or output!) string handle then we can use it

  • otherwise if target is a string handle then it must be in error (as it should have passed in the previous test)

  • if target is a string then it is deemed a filename and opened (for input or output)

  • if target is #n then it is a nickname for “/dev/null” which is opened (for input or output)

  • otherwise, target is an invalid value

Assuming we still going then we set the appropriate handle (set-input-handle!, set-output-handle! or set-error-handle!) and run the thunk.

Finally, set the handle back.

Notice that we only set the handle. We don’t deal in file descriptors, here. If (and only if?) we get round to trying to actually run an external command to we actually care about file descriptors.

If we had redirected the output of some expression into a string handle and the expression was some Idio code then there’s no need for any file descriptors to get involved. Only if the expression reduced down to an external command will we start considering file descriptors. In the particular case of a string handle, we need to indirect through a temporary file as well.

with-handle-dup

The actual I/O redirection operators, <&, >& and 2>&, determine that they affect a type of 'input, 'output or 'error and take the first expression after as the target.

They then package the command before the redirection statement in the form of a thunk and pass it to with-handle-dup with the type and target arguments.

with-handle-dup then defines a bunch of accessor methods based on the type and then look at the target:

  • if target is the correct direction (input or output!) file descriptor handle then we can use it

    Note that all file handles are (by definition?) file descriptor handles.

    This also covers pipe handles which are also file descriptor handles.

  • otherwise if target is a file descriptor handle then it must be in error (as it should have passed in the previous test)

  • if target is the correct direction (input or output!) string handle then we can use it

  • otherwise if target is a string handle then it must be in error (as it should have passed in the previous test)

  • if target is a C/int or fixnum with a value of 0, 1 or 2 then the current input, output or error handle is used as the target handle

  • otherwise, target is an invalid value

Assuming we still going then we set the appropriate handle (set-input-handle!, set-output-handle! or set-error-handle!) and run the thunk.

Finally, set the handle back.

Again, we’re only dealing with handles, here, no file descriptors.

Last built at 2024-10-18T06:11:22Z+0000 from 463152b (dev)