.. include:: ../global.rst *************************** The :lname:`Idio` Allocator *************************** Let's have a little digression into (memory) `allocators `_ rather than garbage collectors *per se*. This becomes a very complicated field. We're partly here because I thought that things could be done with an allocator to improve *stuff*. Which might be true but, it transpires, not obviously by me. However, having been through the process, a) we're no worse off (always a good thing), b) it *is* interesting and c) I've left it turned on by default. Variations ========== Hiding in the :lname:`Bash` codebase is :file:`lib/malloc/malloc.c` which is derived from some :lname:`Emacs` improvements to :ref-author:`Chris Kingsley`'s 1982 fast storage allocator. In fact, an Internet search for "malloc.c Chris Kingsley" will show derivations appearing in many places. Annoyingly, I can't find a definitive source for the original so something like the FreeBSD `libexec/rtld-elf/malloc.c `_ might have to do. That version is under 500 lines of code. Another scheme is :ref-author:`Doug Lea`'s 1987 `dlmalloc `_ and the code at `http://gee.cs.oswego.edu/pub/misc/malloc.c `_ -- the site itself seems to use an old link. :lname:`dlmalloc` is over 6000 lines of code. A "pthreads malloc" (ptmalloc) was derived from that and GNU_'s :lname:`libc` malloc `derived from that `_ in turn. .. aside:: Wait! *What?* These can start getting a bit complicated with :ref-author:`Jason Evans`' `http://jemalloc.net `_ powering FreeBSD and Facebook. It is almost 100k lines of code across 400 files. I think we want something better than naïve, digestible and, sorry, Jason, not doubling the size of the code base. There's a comment in a :file:`vxmalloc.c` online from :ref-author:`Hwa Jin Bae` which says: ...contains two different algorithms. One is the BSD based Kingsley "bucket" allocator which has some unique fragmentation behavior. The other is Doug Lea's well tested allocator that tries to minimize fragmentation while keeping the speed/space requirements. That sounds like a fair compromise (not that I've read the code -- it seemed too long!). There's another feature we need to be aware of which appears in that :lname:`Emacs`/:lname:`Bash` variant in that a POSIX requirement is that an allocator be re-entrant. In the absence of threading we *are* still at risk of reentrancy as we support signal interrupts and we could trip over ourselves that way -- although we shouldn't as we only interrupt the :lname:`Idio` code (not the :lname:`C` code). Besides it would be nice to be ready for threading should we ever get there. Design ====== So :file:`src/malloc.c` is a :lname:`Bash` inspired variation on the Kingsley fast storage allocator. The basic mechanism is that you assign allocation requests into chains of buckets where a bucket (chain) handles allocations sized up to 2\ :sup:`n`. That said, the actual allocation is extended to include some accounting overhead and some space for "range markers" which can be used to verify that users haven't been poking around outside their allotment. Not that there's much you can do, at the time you check (re-allocation or freeing) the damage has been done. That's the :lname:`C` way! In the FreeBSD variant the range marker is a duplicate of the range magic number in the :samp:`{overhead}` structure. The :lname:`Bash` variant uses a slightly cleverer memory guard which encodes the number of allocated bytes as the terminal marker. Slightly cleverer as continuing to locate it at the end of the user's allocation means it is just as likely to be stomped over as anything else. Of course, if you know how the :samp:`{overhead}` and memory guards work and, recognising that they are immediately before and after your allocated memory (broadly, ``a[-2]`` and ``a[n+1]``), you can probably "adjust" both for fun and profit. As a partial mitigation, the :lname:`Bash` code supports a malloc "register" which remembers the allocation separately from the allocation itself. So if you request 20 bytes, say, you'll get passed into the 64 byte bucket chain (your 20 bytes plus another 16-odd of overhead etc. takes you beyond a 32 byte bucket); a 200 byte request goes into the 256 bucket chain (as the added overhead bytes don't bump us into the next bucket). The overhead is a bit vague as it is a fixed size header, a fixed sized range marker and some padding. One thing to note is that only one user allocation request goes into each bucket. We don't try to squeeze a small thing in on the end of a quite large but not quite filling a bucket. We've made an effort to find the smallest bucket that is at least as big as the request (plus overheads) and we leave it at that. However, you can see that any bucket allocation is going to waste increasingly large amounts of memory as any 2\ :sup:`n` request by you is going to go into a 2\ :sup:`n+1` bucket. Although there are some potential positives, for example, if you :manpage:`realloc(3)` and keep within the 2\ :sup:`n` (minus overheads) limits, the code only need tweak the allocated size field. The buckets have to live somewhere and so they are tacked onto a chain. We need a chain per bucket size and so there's a :samp:`{nextf}` ("next free"?) pointer per bucket size. The obvious implementation, then, is an array of pointers acting as the heads of the chains. Buckets ------- The set of buckets is more interesting than it should be. How many should we have? For example, with a 32-bit counter for our bucket size, ``ovu_size``, we'd have, uh, 32 buckets, right? Of course not. I wouldn't be mentioning it if it was that easy. There's a couple of things to think about. In the first instance, our :samp:`{overhead}` is eight bytes, in this case, so there's no point in having a bucket size less than that. In fact you might think that as the :samp:`{overhead}` has to fit into the bucket as well then there's no point in having an eight byte bucket either as it wouldn't leave any room for user-data. However, we hit our second issue in that the our algorithm for washing about references :samp:`{bucket} - 1` which means any actually-in-use bucket chain needs to have a previous bucket to be referenced, if not actually used. Hence the buckets start with the 0\ :sup:`th` bucket chain being size eight bytes -- albeit it is not used, it's just a placeholder. The first usable bucket chain, #1, is for sixteen byte allocations which, given our :samp:`{overhead}` and (two byte) range marker means it's for small single digit allocations! That said, I see it gets used about a thousand times during a test run so it's not to be sniffed at! At the other end of the scale, given we're starting at 2\ :sup:`3` (eight bytes in chain #0) and we're limited to 32-bit allocations, then we can only usefully use 29 more chains with the last, #29 being allocations of 2\ :sup:`32` which...doesn't work on 32 bit systems. Your calculation of 2\ :sup:`32` might well be 0 anyway. Hmm, :lname:`Bash` uses a fixed array of ``binsizes`` with the last bin size being 2\ :sup:`32` - 1. I didn't like that fixed array and so dynamically set these values which requires a small consideration for bucket #29's size. In fact that last bucket size is set as an ``unsigned long`` value but *used* as a ``long``, ie. a ``signed long``. This means the last bucket's "size" is -1 and is used as a flag to denote failure. That itself means the largest allocatable size is 2\ :sup:`31` minus overheads. Bootstrap --------- The :lname:`FreeBSD` code requires genuine bootstrap in that the caller has to initialise the ``pagesizes[]`` array and ``pagepool_start``/``pagepool_end`` pointers. To what, for the latter two, isn't especially clear. The :lname:`Bash` code is a little more dynamic and allocates memory on demand. We look at the user's requested allocation and add in the various overheads. We then look for the smallest bucket chain capable of holding this request. If we've no free buckets in the chain then we grab a minimum of one page of memory (using a *morecore* function), this :lname:`Idio` code uses :manpage:`mmap(2)`, and divvy it up into however many 2\ :sup:`n` buckets. In *morecore* we only care enough about overhead etc. to stamp in a "free" flag and "next" pointer into each bucket such that the chain will now have some free buckets on it. Obviously, if the requested amount is more than one page then we grab enough pages to cover the requested allocation (plus overheads) rounded up to the 2\ :sup:`n` bucket size and a single bucket is added to the chain. Accounting ---------- The nominal :samp:`{overhead}` "structure" is actually a union with the :samp:`{overhead}` structure a member: .. code-block:: c :caption: :file:`src/malloc.c` union idio_malloc_overhead_u { uint64_t o_align; /* 8 bytes */ struct { uint8_t ovu_magic; /* magic number 1 */ uint8_t ovu_bucket; /* bucket # 1 */ uint16_t ovu_rmagic; /* range magic number 2 */ uint32_t ovu_size; /* actual block size 4 */ } ovu; }; which is word-aligned (for large words). The interesting bit is the ``struct ovu``: * the ``ovu_magic`` number records whether the block has been allocated or freed * the ``ovu_bucket`` records the bucket (*duh!*) which gives us a quick way back to the chain of buckets for when the bucket is freed * the ``ovu_rmagic`` is one end of the range marker * the ``ovu_size`` is the requested allocation (by the user) Of interest, here, is that #. the whole union is 64-bit aligned #. but only allows for 32-bit allocations This means the :samp:`{overhead}` is 8 bytes or one pointer on a 64-bit system. If we allowed for 64-bit allocations then the block's size (``ovu_size``) would consume a word on its own and the rest of the :samp:`{overhead}` would fill out to two words. Matching regular :manpage:`malloc(3)`. 64-bit ^^^^^^ .. aside:: *And don't call me Shirley!* I went back and looked at this 32-bit allocation limit. We can do better, surely? If we toggle based on the platform's word length (there must be a better test than ``PTRDIFF_MAX`` -- or other integer) and parameterise a few things: .. code-block:: c :caption: :file:`src/malloc.c` #if PTRDIFF_MAX == 2147483647L #define idio_alloc_t uint32_t #define IDIO_PRIa PRIu32 #define IDIO_MALLOC_NBUCKETS 30 #define IDIO_MALLOC_FIRST_Po2 3 #else #define idio_alloc_t uint64_t #define IDIO_PRIa PRIu64 #define IDIO_MALLOC_NBUCKETS 62 #define IDIO_MALLOC_FIRST_Po2 4 #endif where ``IDIO_MALLOC_FIRST_Po2`` ("Power-of-2") is used to set the size of the first bucket, 2\ :sup:`3` or 2\ :sup:`4`, and is used elsewhere in the stats gathering if ``IDIO_DEBUG`` is set. We can then change ``ovu_size`` to the normalised ``idio_alloc_t``: .. code-block:: c :caption: :file:`src/malloc.c` union idio_malloc_overhead_u { uint64_t o_align; /* 8 bytes */ struct { uint8_t ovu_magic; /* magic number 1 */ uint8_t ovu_bucket; /* bucket # 1 */ uint16_t ovu_rmagic; /* range magic number 2 */ idio_alloc_t ovu_size; /* actual block size 4/8 */ } ovu; }; and adjust any relevant size-referencing variable/:manpage:`printf(3)` statement to use ``idio_alloc_t``/``IDIO_PRIa`` then we appear to cover our bases. Testing """"""" In the modern age of host machines with a decent amount of RAM and easily built VMs we should have easy enough access to a 64-bit machine with more than 4GB of RAM and a 32-bit machine with 4GB of RAM. .. warning:: If you're keen to play around then I advise you to do this test on machines with those quantities of **RAM**. They should work with those amounts of virtual memory but all you will do is impress yourself that Unix based operating systems can (eventually) recover from extended periods of *thrashing*. The problem is that in creating an :lname:`Idio` array, we *will* write a default value (``#f``) into every slot making every (virtual memory) page dirty. If you don't have enough RAM the OS will swap those dirty pages out in order that it can do exactly the same with a new page. To test this we'll create an :lname:`Idio` array. We need to keep ourselves aware of what's happening under the hood, though, otherwise we'll have sad faces. In the first instance, the :lname:`Idio` array is made from a :lname:`C` array of pointers -- which are going to be four or eight bytes each, thus bumping up our actual allocation. On top of that we have the tricksy problem that a 2\ :sup:`n` allocation will (have to) go into a 2\ :sup:`n+1` bucket and it's the bucket that we need allocated. On a 64-bit machine pointer are eight bytes, 2\ :sup:`3`, so if we are targeting a 16GB allocation, 2\ :sup:`34` bytes, on a machine with 16GB of RAM we want to create an array of 2\ :sup:`31` elements. The 2\ :sup:`31` :lname:`Idio` elements becomes a 2\ :sup:`34` :lname:`C` byte allocation which will go into a 2\ :sup:`35` bucket. Or, rather, it won't because that's more than the virtual memory available to us. .. sidebox:: See :ref:`Address Space` for some notes on physically addressable memory. Other processes are using memory, including us, so we'll aim a bit lower. Any number which generates an allocation over 4GB is good enough as it demonstrates 64-bit handling and we're in no position to properly test a full (nearly) 2\ :sup:`64` byte allocation. We've a couple of options here. #. We can reduce the number of elements in the array by a bit, keeping us below any next-bucket triggering: .. code-block:: idio-console Idio> make-array ((expt 2 31) - 1000000) Killed *Eh?* Oh, the Linux *Out Of Memory* killer has, er, "helped" us out. #. We can go down a power of two and look to get bumped up into the next bucket... .. code-block:: idio-console Idio> make-array (expt 2 30) #[ #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ..[1073741804] #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ] Cool. Probably. :socrates:`What have we done, exactly?` We've created an array of a billion elements (the printer, thankfully, omits some in the middle). A billion elements means eight billion bytes and therefore in the sixteen billion byte bucket. We should be able to get another 50% in there: .. code-block:: idio-console Idio> make-array ((expt 2 30) + (expt 2 29)) #[ #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ..[1610612716] #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ] which looks very much like we've created (and immediately thrown away) a 12GB array. It looks plausible. On a 32-bit machine, targeting the maximum 2\ :sup:`32` byte allocation with four byte pointers we need to play the same games with an exception for the very trick we're trying to pull. .. sidebox:: In fact, I seem to get an issue with :manpage:`mmap(2)` and anything over half of memory. On a 32-bit machine we *can* feasibly allocate all of memory so we have to handle the largest cases carefully. A 2\ :sup:`30` element array will become a 4GB allocation and is too much. A 2\ :sup:`29` element array is "only" half of RAM although I hit the :manpage:`mmap(2)` limit so we should aim for a 2\ :sup:`29` - 10 element array for just under half of RAM: .. code-block:: idio-console Idio> make-array ((expt 2 29) - 10) #[ #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ..[536870882] #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ] The :lname:`libc` allocator is different still. On a 32-bit CentOS 6 VM I can ``make-array`` 2\ :sup:`29` + 2\ :sup:`27` but not 2\ :sup:`29` + 2\ :sup:`28`. Maybe some :manpage:`sbrk(2)`/:manpage:`mmap(2)` features. next pointers ^^^^^^^^^^^^^ The :samp:`{overhead}` structure is potentially/partially re-purposed, here, to be the "next" in the chain pointer. This creates its own problems as we need to ensure that a bucket in the free list (stamping a pointer in the :samp:`{overhead}`) doesn't conflict with a bucket in use with magic numbers and sizes etc.. In the :lname:`FreeBSD` code, there is a ``next`` pointer in the overhead union which directly impacts the :samp:`{overhead}` structure. To mitigate this it plays a similar trick to Idio itself and recognises that a pointer will always have the trailing two bits as 0 and so long as the corresponding :samp:`{overhead}` field (``ovu_rmagic``) does not have trailing 0s there we can differentiate between the two. The code assumes we have a big-endian machine and we can set the range magic number to be an odd number (and thus not ending in two zeros). In the :lname:`Bash` code the ``CHAIN`` macro uses a pointer to a ``char *`` inside the block. That ``sizeof (char *)`` offset means we skip over the ``ovu_magic``, ``ovu_bucket`` and ``ovu_rmagic`` fields on all platforms. Given the block is 64-bit aligned that means on a 64-bit platform you're setting the ``next`` pointer in the (nominally) user-allocated section and on a 32-bit platform you're overwriting some part of the :samp:`{overhead}` structure -- it should be the ``ovu_size`` field. It's not clear why the (nominally) user-allocated section isn't used all the time for the ``next`` pointer. Maybe it's some heinous combination of the :lname:`C` macros on obscure systems. The nearest I can think is that for bucket #0 (eight bytes) there is no user-data section so we'd be in trouble. But bucket #0 is a placeholder, we don't actually use it. I've set the :lname:`Idio` code so it always uses the user-data portion...and we're still here! mmap/sbrk ^^^^^^^^^ I've only used :manpage:`mmap(2)` in this implementation but the :lname:`Bash` code goes a bit further and uses :manpage:`sbrk(2)` for allocations under a page size. I don't know if that makes this implementation good or bad in relation. Obviously the :lname:`Bash` (née :lname:`Emacs`) code has both some considerable history and wider (operating system) distribution and so has to cover more bases. All I can say is that...it works for me! I fancy there may be bridges to be crossed. I also notice that :manpage:`sbrk(2)` is deprecated in Mac OS X/Big Sur with the following commentary: **The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management.** Implementation ============== The implementation of ``malloc``, ``calloc``, ``realloc`` and ``free`` works pretty much as you might expect with having to rummage around finding buckets etc.. If you have run a debug build with ``--vm-reports`` then the file :file:`./vm-perf` will have some allocator stats for you. This is where I thought there might be some improvements to be had. If you look at the stats you can see that one of two of the chains are far more popular than others. Perhaps not surprisingly, the smaller sized ones. Pulling some numbers out: .. csv-table:: bucket:, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096*, 8192, ... free:, 0, 868K, 1447K, 41K, 142K, 2487, 1592, 181, 0, 0, ... used:, 0, 7679, 233K, 1644K, 49K, 185, 24, 11, 41, 0, ... peak:, 0, 876K, 1680K, 1686K, 192K, 2668, 1614, 192, 42, 6, ... mmap:, 0, 6846, 26K, 52K, 12K, 334, 404, 96, 99, 22, ... munmap:, 0, 0, 0, 0, 0, 0, 0, 0, 58, 22, ... you can see from the *peak* numbers that a lot of 32, 64 and 128 byte buckets were allocated. The actual number of those won't change -- we can't change the user-code -- but what we might tweak is the number of ``mmap`` calls, tens of thousands. So I looked at asking for, say, 1000 pages at a time so that the number of ``mmap``\ s would drop to the tens. That makes almost no difference at all. Perhaps as you'd expect as you're only reducing the number of system calls in the overall process run from the millions (I would presume) by a few thousand. .. include:: ../commit.rst