.. include:: ../../global.rst

.. _`job control`:

***********
Job Control
***********

As mentioned before (and probably will be again) we're largely
following the directions in the :program:`info` pages on *Job Control*
(in *libc* or read the `online Job Control`_ pages) but there's a bit
more to it.

The *Job Control* documentation works pretty well for organising
single or pipelines of *external* commands.  However, :lname:`Idio` is
a programming language and even :lname:`Bash` allows you a few twirls
and twists.  Consider:

.. code-block:: console

   $ /usr/bin/echo hello | wc
         1       1       6

No real surprises (hopefully).  Now what about:

.. code-block:: console

   $ { sleep 1; echo hello; } | wc
	 1       1       6

Hmm, that's a bit more interesting.  We have a shell *Group Command*
as the first "external" process in the pipeline and I've subtly
altered the ``echo`` statement.  What's happening there?

Well, for a start the Group Command *is* in an external process as the
pipeline forces each segment of the pipeline into a child process.
But then what happens?

Well, in the right-hand child process, as the child process is just a
:manpage:`fork(2)` of the original shell then :lname:`Bash` is ready
to process what it sees.  It sees the external command :program:`wc`
but, importantly, it only sees the single command :program:`wc` so
will :manpage:`execve(2)` it in place of itself.  It is running in the
context of the pipeline, ie. its *stdin* is connected to the output of
the previous part of the pipeline.

For the left-hand child process, again, a child :lname:`Bash`, is
ready to process what it sees which is a Group Command.  The first
command in the group is the external command :program:`sleep` which
the child :lname:`Bash` process runs in the foreground -- just like it
would have in the main shell.  That's important as running a job in
the foreground also involves *fork*'ing and *exec*'ing but now it is a
child of the child :lname:`Bash`.

:program:`sleep` will slow things down for a bit.  When it completes
:lname:`Bash` sees the next command, the *builtin* command
:program:`echo` which it runs itself -- no external command with
*fork* and *exec* required.  This is important as it clearly requires
that :lname:`Bash` be running to be able to, uh, run the builtin
command.

Of course, the left-hand child process, a :lname:`Bash`, in the
pipeline has its *stdout* connected to the input of the following part
of the pipeline and, remembering that child processes inherit their
parents file descriptors, the ``hello\n`` from :program:`echo` winds
its way through to :program:`wc`.

So, whilst providing no difference in output (though possibly a
difference in timing!) we have something altogether less obvious
happening.  Whilst the pipeline is ostensibly *fork*'ing and
*exec*'ing like the *Job Control* pages suggest, in practice there's
some extra *shell*'ing in the way.

If we're a little more inquisitive we can see the two variations:

.. code-block:: console

   $ ps -Ht $(tty) | cat -
       PID TTY          TIME CMD
     67896 pts/0    00:00:00 bash
     73717 pts/0    00:00:00   ps
     73718 pts/0    00:00:00   cat

   $ { sleep 1; ps -Ht $(tty); } | cat -
       PID TTY          TIME CMD
     67896 pts/0    00:00:00 bash
     73739 pts/0    00:00:00   bash
     73743 pts/0    00:00:00     ps
     73740 pts/0    00:00:00   cat

In the first case it is just two external commands *fork*'ed and
*exec*'ed.  In the second, there is an extra :program:`bash` process
parenting the :program:`ps` which would have also parented the
:program:`sleep` but it has been and gone.

.. _`job control considerations`:

Considerations
==============

Broadly, you expect, it's a case of being a bit more careful with the
book keeping.  In our case there's also a problem with who is doing
what (and when).

.. aside::

   Unbridled enthusiasm, I'm afraid.

We'll see in :ref:`pipelines and IO` that pipelines are implemented by
a reader operator.  In other words, it's all handled in
:lname:`Idio`-land.

Almost.  In practice, the ``|`` reader operator conspires to arrange
the pipeline through a hefty but straight-forward template in which it
embeds the original code, the individual command snippets in the
pipeline.

Non-pipeline, ie. "simple" commands are handled in two ways.  If
there's a specific kind of command, eg. :ref:`collect-output` then,
again, the code is arranged in :lname:`Idio`-land with the snippet
embedded.

Any other external command *and*, therefore, those embedded snippets
are identified by the VM trying to invoke a symbol, eg. the ``ls`` in
``ls -l`` (assuming ``ls`` isn't bound to some value).  It then asks
the system to find an external command on :envvar:`PATH` by the name
of ``ls`` and then will call on the original :lname:`C` implementation
of *Job Control* to actually run the command.

So, ``echo "hello"`` will be run from the :lname:`C` implementation
and ``echo "hello" | wc`` will run through the ``|`` reader operator
creating two child :lname:`Idio` processes each of which has a splodge
of code in which are ``echo "hello"`` and ``wc`` respectively, each of
which will be run by the :lname:`C` implementation.

That's mostly saying it all boils down to the :lname:`C`
implementation to decide whether to *fork* and *exec* or just *exec*.

.. aside::

   I need a credible "yes" case!

Does it make a difference?  Hmm, mostly no except sometimes maybe yes.

How can we decide?  Well, we could decide by looking at the number of
things we're about to run.  If there's more than one then we're
probably looking to run a block of code like the Group Command
example.  One problem here is that only the :lname:`Idio`-land code
knows whether it is embedding more than one command so it would need
to flag the decision.

Even if we're only embedding a single command from :lname:`Idio`, *it*
doesn't know whether that is an external command (so can be directly
*exec*'ed) or not.  If I've imported ``libc`` then I'll get the
``libc/sleep`` primitive when I call ``sleep 1``.  We can't *exec*
primitives.

But back to that Group Command case.  Remember that the (external)
commands in the Group Command became children of the
sub-:lname:`Bash`.  So that's important (obvious but important) that
we are making our decisions about *fork*'ing and/or *exec*'ing with
respect to the "current" shell which might itself be a child of
another shell.

What *that's* trying to say is that any measure or comparison with
PIDs or PGIDs must maintain some concept of what is current.  It's not
(always) going to be the values for the original shell.

Process Groups
--------------

On the subject of Process Groups a naïve port of the Job Control code
gets us into trouble.  This manifested itself, for me, as my editor
disappearing and being logged out of my development machine.  *Hmm,
that's odd!*

I thought it was me for a while until I realised I was leaving a trail
of :ref:`asynchronous commands` and the Linux *Out Of Memory* killer
was `having a field day
<https://www.collinsdictionary.com/dictionary/english/have-a-field-day>`_
with me.

The problem partly lies in the expectations of that Job Control
implementation.  It is predicated on, an is expecting to be, running
external commands whereas we've just seen that in practice we're only
ever running more :lname:`Idio` (cf. :lname:`Bash`) expressions --
which may or may not include external commands.

What makes it worse are those asynchronous commands which, come the
end of the shell's lifetime need to be handled like stopped
jobs.

The Job Control code only sets the process group ID if the shell is
interactive.  The underlying reason, here, being that:

    A subshell that runs non-interactively cannot and should not
    support job control.  It must leave all processes it creates in
    the same process group as the shell itself; this allows the
    non-interactive shell and its child processes to be treated as a
    single job by the parent shell.

    --- *Initializing the Shell*

There's two problems here.  Firstly, we don't set the PGID at all if
we're non-interactive and secondly, we need to kill off any
outstanding asynchronous commands when :lname:`Idio` shuts down.  For
the latter, of course, we could do with a PGID.

    Us not setting a PGID means the job runs with whatever PGID it
    inherits from its parent.  This brings an interesting shutdown
    question as if we walk over our list of jobs and send the job's
    PGID a terminal signal, *we'll* get that signal, in the middle of
    sending signals to jobs.

    Ostensibly, that's a *who cares?* moment as that means the job is
    done, right?  Well, as we're about to suggest, perhaps not all
    jobs are going to be in the same process group.

The :lname:`Bash` authors have a long comment in
``process_substitute()`` in :file:`subst.c` which boils down to
"asynchronous commands should be in their own process group".  It's a
bit more complicated than that as :lname:`Bash` only does that if
there had been a job control enabled instance somewhere in its process
history.

Is this important?  Sort of.  Nominally, with Job Control there is an
expectation that a non-interactive shell has all of its children in
the same process group so that its own parent can treat it like a
single job.  As soon as we put job in separate process groups, for
whatever reason, then we're breaking that mould.

    By way of alternative, we could handle asynchronous jobs on
    termination by identifying that the PGID is 0 then walking over
    the individual processes in the job and sending them a SIGTERM
    individually.  Which is fine but we'll get a tirade of SIGCHLDs on
    the back of various exits and signals depending on what everyone
    was doing.

    That's not a brilliant alternative, though, as not sending a
    SIGTERM to the entire process group (but merely the processes we
    know about) means we'll miss a proportion of the processes
    involved.  Not looking quite so clever.

It's all a bit... *meh!*

.. aside::

   To heck with convention!

Revisiting the algorithm, then, we'll set the process' PGID and the
job structure's field if we are an asynchronous command.

If :lname:`Idio` is given to exit then alongside any stopped jobs we
can also send a SIGTERM to any asynchronous commands.

Floating around here are another couple of bits of house-keeping:

* ``libc/fork`` disables interactivity in the child,
  ``job-control/%idio-interactive`` becomes ``#f``

  Only the original :lname:`Idio` can be interactive with the
  controlling *tty*, right?  Assuming it was an interactive session in
  the first place.

* we must set ``%idio-jobs`` to ``#n`` in the child :lname:`Idio`

  This is because the parent :lname:`Idio` accumulates a merry
  collection of asynchronous commands.  If we're not careful then any
  child :lname:`Idio` inherits this list meaning that it, and any
  other inheritor has a nice list of asynchronous commands that it
  thinks it is in the position of sending a SIGTERM to when the
  inheritor comes to exit.

Interactivity
-------------

There's another, somewhat anomalous, bit of behaviour in the example
code.  I might have missed something but in the ``launch_job()``
function it says:

.. code-block:: c

   if (!shell_is_interactive)
     wait_for_job (j);
   else if (foreground)
     put_job_in_foreground (j, 0);
   else
     put_job_in_background (j, 0);

which says to me that if we are not interactive then we will *always*
call ``wait_for_job()``.

That's not looking so clever for any jobs you want to run in the
background like almost every single initialisation script for a
daemon.  If you're not interactive, like a daemon initialisation
script, then you don't get to run the job in the background and then
exit.  You hang about for the job to complete (which, presumably, it
won't).

``put_job_in_foreground()`` also calls ``wait_for_job()`` but in the
middle of two bits of terminal handling.

.. aside::

   Maybe I've missed a huge trick?

It seems to me, that ``put_job_in_foreground()`` should be testing for
interactivity around the terminal handling and therefore satisfying
the nominal "foreground" job for interactive and non-interactive cases
leaving a "background" job to be left to run in the background in the
expected way.

Forking Hell
============

*[ Surely no-one, documenting a shell, can resist that title?  I know I can't! ]*

Most of the time we're merrily launching *jobs* and, by and large, we
collect them all up again.

However, we have released :manpage:`fork(2)` upon our suspecting
public and things can go subtly wrong.

Here's the problem.  If I, independently from :lname:`Idio`'s official
*Job Control* mechanisms, fork and exec stuff then :lname:`Idio`
doesn't know anything about it.  I might fork a sub-:lname:`Idio`, let
it do its thing and then in the parent :lname:`Idio` call
:manpage:`waitpid(2)` with the child PID I just forked.

Surprisingly (to me) this works without an issue rather a lot of the
time but doesn't always.  Sometimes ``libc/waitpid`` fails because
there is "no child PID."  *Spooky!*

The problem is that :lname:`Idio` gets a signal that a child has died
and starts reaping PIDs.  All of them that have died.  Normally these
are elements of a job and the job gets updated by
``mark-process-status``.

If those include your independent PID then... well, :lname:`Idio`
doesn't know anything about it.  All it can do is shovel it in a table
of "stray" PIDs.

We can now have ``libc/waitpid`` do some checking.  If the system
called failed and the error was ECHILD then we'll take a look in the
table of stray PIDs and pluck the stashed results out of there.
Otherwise we'll return the previous stock answer of ``(0, #n)``.

.. include:: ../../commit.rst