The Idio Allocator

Let’s have a little digression into (memory) allocators rather than garbage collectors per se. This becomes a very complicated field.

We’re partly here because I thought that things could be done with an allocator to improve stuff. Which might be true but, it transpires, not obviously by me. However, having been through the process, a) we’re no worse off (always a good thing), b) it is interesting and c) I’ve left it turned on by default.

Variations

Hiding in the Bash codebase is lib/malloc/malloc.c which is derived from some Emacs improvements to Chris Kingsley’s 1982 fast storage allocator. In fact, an Internet search for “malloc.c Chris Kingsley” will show derivations appearing in many places. Annoyingly, I can’t find a definitive source for the original so something like the FreeBSD libexec/rtld-elf/malloc.c might have to do. That version is under 500 lines of code.

Another scheme is Doug Lea’s 1987 dlmalloc and the code at http://gee.cs.oswego.edu/pub/misc/malloc.c – the site itself seems to use an old link. dlmalloc is over 6000 lines of code.

A “pthreads malloc” (ptmalloc) was derived from that and GNU’s libc malloc derived from that in turn.

These can start getting a bit complicated with Jason Evanshttp://jemalloc.net powering FreeBSD and Facebook. It is almost 100k lines of code across 400 files.

I think we want something better than naïve, digestible and, sorry, Jason, not doubling the size of the code base.

There’s a comment in a vxmalloc.c online from Hwa Jin Bae which says:

…contains two different algorithms. One is the BSD based Kingsley “bucket” allocator which has some unique fragmentation behavior. The other is Doug Lea’s well tested allocator that tries to minimize fragmentation while keeping the speed/space requirements.

That sounds like a fair compromise (not that I’ve read the code – it seemed too long!).

There’s another feature we need to be aware of which appears in that Emacs/Bash variant in that a POSIX requirement is that an allocator be re-entrant. In the absence of threading we are still at risk of reentrancy as we support signal interrupts and we could trip over ourselves that way – although we shouldn’t as we only interrupt the Idio code (not the C code). Besides it would be nice to be ready for threading should we ever get there.

Design

So src/malloc.c is a Bash inspired variation on the Kingsley fast storage allocator.

The basic mechanism is that you assign allocation requests into chains of buckets where a bucket (chain) handles allocations sized up to 2n.

That said, the actual allocation is extended to include some accounting overhead and some space for “range markers” which can be used to verify that users haven’t been poking around outside their allotment. Not that there’s much you can do, at the time you check (re-allocation or freeing) the damage has been done. That’s the C way!

In the FreeBSD variant the range marker is a duplicate of the range magic number in the overhead structure. The Bash variant uses a slightly cleverer memory guard which encodes the number of allocated bytes as the terminal marker. Slightly cleverer as continuing to locate it at the end of the user’s allocation means it is just as likely to be stomped over as anything else.

Of course, if you know how the overhead and memory guards work and, recognising that they are immediately before and after your allocated memory (broadly, a[-2] and a[n+1]), you can probably “adjust” both for fun and profit.

As a partial mitigation, the Bash code supports a malloc “register” which remembers the allocation separately from the allocation itself.

So if you request 20 bytes, say, you’ll get passed into the 64 byte bucket chain (your 20 bytes plus another 16-odd of overhead etc. takes you beyond a 32 byte bucket); a 200 byte request goes into the 256 bucket chain (as the added overhead bytes don’t bump us into the next bucket).

The overhead is a bit vague as it is a fixed size header, a fixed sized range marker and some padding.

One thing to note is that only one user allocation request goes into each bucket. We don’t try to squeeze a small thing in on the end of a quite large but not quite filling a bucket. We’ve made an effort to find the smallest bucket that is at least as big as the request (plus overheads) and we leave it at that.

However, you can see that any bucket allocation is going to waste increasingly large amounts of memory as any 2n request by you is going to go into a 2n+1 bucket. Although there are some potential positives, for example, if you realloc(3) and keep within the 2n (minus overheads) limits, the code only need tweak the allocated size field.

The buckets have to live somewhere and so they are tacked onto a chain. We need a chain per bucket size and so there’s a nextf (“next free”?) pointer per bucket size. The obvious implementation, then, is an array of pointers acting as the heads of the chains.

Buckets

The set of buckets is more interesting than it should be. How many should we have? For example, with a 32-bit counter for our bucket size, ovu_size, we’d have, uh, 32 buckets, right?

Of course not. I wouldn’t be mentioning it if it was that easy. There’s a couple of things to think about. In the first instance, our overhead is eight bytes, in this case, so there’s no point in having a bucket size less than that.

In fact you might think that as the overhead has to fit into the bucket as well then there’s no point in having an eight byte bucket either as it wouldn’t leave any room for user-data. However, we hit our second issue in that the our algorithm for washing about references bucket - 1 which means any actually-in-use bucket chain needs to have a previous bucket to be referenced, if not actually used.

Hence the buckets start with the 0th bucket chain being size eight bytes – albeit it is not used, it’s just a placeholder. The first usable bucket chain, #1, is for sixteen byte allocations which, given our overhead and (two byte) range marker means it’s for small single digit allocations!

That said, I see it gets used about a thousand times during a test run so it’s not to be sniffed at!

At the other end of the scale, given we’re starting at 23 (eight bytes in chain #0) and we’re limited to 32-bit allocations, then we can only usefully use 29 more chains with the last, #29 being allocations of 232 which…doesn’t work on 32 bit systems. Your calculation of 232 might well be 0 anyway.

Hmm, Bash uses a fixed array of binsizes with the last bin size being 232 - 1.

I didn’t like that fixed array and so dynamically set these values which requires a small consideration for bucket #29’s size.

In fact that last bucket size is set as an unsigned long value but used as a long, ie. a signed long. This means the last bucket’s “size” is -1 and is used as a flag to denote failure.

That itself means the largest allocatable size is 231 minus overheads.

Bootstrap

The FreeBSD code requires genuine bootstrap in that the caller has to initialise the pagesizes[] array and pagepool_start/pagepool_end pointers. To what, for the latter two, isn’t especially clear.

The Bash code is a little more dynamic and allocates memory on demand.

We look at the user’s requested allocation and add in the various overheads. We then look for the smallest bucket chain capable of holding this request.

If we’ve no free buckets in the chain then we grab a minimum of one page of memory (using a morecore function), this Idio code uses mmap(2), and divvy it up into however many 2n buckets.

In morecore we only care enough about overhead etc. to stamp in a “free” flag and “next” pointer into each bucket such that the chain will now have some free buckets on it.

Obviously, if the requested amount is more than one page then we grab enough pages to cover the requested allocation (plus overheads) rounded up to the 2n bucket size and a single bucket is added to the chain.

Accounting

The nominal overhead “structure” is actually a union with the overhead structure a member:

src/malloc.c
union idio_malloc_overhead_u {
    uint64_t o_align;        /*                                      8 bytes */
    struct {
        uint8_t ovu_magic;   /* magic number                         1 */
        uint8_t ovu_bucket;  /* bucket #                             1 */
        uint16_t ovu_rmagic; /* range magic number                   2 */
        uint32_t ovu_size;   /* actual block size                    4 */
    } ovu;
};

which is word-aligned (for large words). The interesting bit is the struct ovu:

  • the ovu_magic number records whether the block has been allocated or freed

  • the ovu_bucket records the bucket (duh!) which gives us a quick way back to the chain of buckets for when the bucket is freed

  • the ovu_rmagic is one end of the range marker

  • the ovu_size is the requested allocation (by the user)

    Of interest, here, is that

    1. the whole union is 64-bit aligned

    2. but only allows for 32-bit allocations

    This means the overhead is 8 bytes or one pointer on a 64-bit system. If we allowed for 64-bit allocations then the block’s size (ovu_size) would consume a word on its own and the rest of the overhead would fill out to two words. Matching regular malloc(3).

64-bit

I went back and looked at this 32-bit allocation limit. We can do better, surely?

If we toggle based on the platform’s word length (there must be a better test than PTRDIFF_MAX – or other integer) and parameterise a few things:

src/malloc.c
#if PTRDIFF_MAX == 2147483647L
#define idio_alloc_t                 uint32_t
#define IDIO_PRIa                    PRIu32
#define IDIO_MALLOC_NBUCKETS         30
#define IDIO_MALLOC_FIRST_Po2        3
#else
#define idio_alloc_t                 uint64_t
#define IDIO_PRIa                    PRIu64
#define IDIO_MALLOC_NBUCKETS         62
#define IDIO_MALLOC_FIRST_Po2        4
#endif

where IDIO_MALLOC_FIRST_Po2 (“Power-of-2”) is used to set the size of the first bucket, 23 or 24, and is used elsewhere in the stats gathering if IDIO_DEBUG is set.

We can then change ovu_size to the normalised idio_alloc_t:

src/malloc.c
union idio_malloc_overhead_u {
    uint64_t o_align;                /*                                      8 bytes */
    struct {
        uint8_t ovu_magic;           /* magic number                         1 */
        uint8_t ovu_bucket;          /* bucket #                             1 */
        uint16_t ovu_rmagic;         /* range magic number                   2 */
        idio_alloc_t ovu_size;       /* actual block size                    4/8 */
    } ovu;
};

and adjust any relevant size-referencing variable/printf(3) statement to use idio_alloc_t/IDIO_PRIa then we appear to cover our bases.

Testing

In the modern age of host machines with a decent amount of RAM and easily built VMs we should have easy enough access to a 64-bit machine with more than 4GB of RAM and a 32-bit machine with 4GB of RAM.

Warning

If you’re keen to play around then I advise you to do this test on machines with those quantities of RAM. They should work with those amounts of virtual memory but all you will do is impress yourself that Unix based operating systems can (eventually) recover from extended periods of thrashing.

The problem is that in creating an Idio array, we will write a default value (#f) into every slot making every (virtual memory) page dirty. If you don’t have enough RAM the OS will swap those dirty pages out in order that it can do exactly the same with a new page.

To test this we’ll create an Idio array. We need to keep ourselves aware of what’s happening under the hood, though, otherwise we’ll have sad faces.

In the first instance, the Idio array is made from a C array of pointers – which are going to be four or eight bytes each, thus bumping up our actual allocation.

On top of that we have the tricksy problem that a 2n allocation will (have to) go into a 2n+1 bucket and it’s the bucket that we need allocated.

On a 64-bit machine pointer are eight bytes, 23, so if we are targeting a 16GB allocation, 234 bytes, on a machine with 16GB of RAM we want to create an array of 231 elements. The 231 Idio elements becomes a 234 C byte allocation which will go into a 235 bucket.

Or, rather, it won’t because that’s more than the virtual memory available to us.

Other processes are using memory, including us, so we’ll aim a bit lower. Any number which generates an allocation over 4GB is good enough as it demonstrates 64-bit handling and we’re in no position to properly test a full (nearly) 264 byte allocation.

We’ve a couple of options here.

  1. We can reduce the number of elements in the array by a bit, keeping us below any next-bucket triggering:

    Idio> make-array ((expt 2 31) - 1000000)
    Killed
    

    Eh?

    Oh, the Linux Out Of Memory killer has, er, “helped” us out.

  2. We can go down a power of two and look to get bumped up into the next bucket…

    Idio> make-array (expt 2 30)
    #[ #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ..[1073741804] #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ]
    

    Cool. Probably. What have we done, exactly? We’ve created an array of a billion elements (the printer, thankfully, omits some in the middle). A billion elements means eight billion bytes and therefore in the sixteen billion byte bucket.

    We should be able to get another 50% in there:

    Idio> make-array ((expt 2 30) + (expt 2 29))
    #[ #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ..[1610612716] #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ]
    

    which looks very much like we’ve created (and immediately thrown away) a 12GB array.

    It looks plausible.

On a 32-bit machine, targeting the maximum 232 byte allocation with four byte pointers we need to play the same games with an exception for the very trick we’re trying to pull.

On a 32-bit machine we can feasibly allocate all of memory so we have to handle the largest cases carefully.

A 230 element array will become a 4GB allocation and is too much. A 229 element array is “only” half of RAM although I hit the mmap(2) limit so we should aim for a 229 - 10 element array for just under half of RAM:

Idio> make-array ((expt 2 29) - 10)
#[ #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ..[536870882] #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f ]

The libc allocator is different still. On a 32-bit CentOS 6 VM I can make-array 229 + 227 but not 229 + 228. Maybe some sbrk(2)/mmap(2) features.

next pointers

The overhead structure is potentially/partially re-purposed, here, to be the “next” in the chain pointer. This creates its own problems as we need to ensure that a bucket in the free list (stamping a pointer in the overhead) doesn’t conflict with a bucket in use with magic numbers and sizes etc..

In the FreeBSD code, there is a next pointer in the overhead union which directly impacts the overhead structure.

To mitigate this it plays a similar trick to Idio itself and recognises that a pointer will always have the trailing two bits as 0 and so long as the corresponding overhead field (ovu_rmagic) does not have trailing 0s there we can differentiate between the two. The code assumes we have a big-endian machine and we can set the range magic number to be an odd number (and thus not ending in two zeros).

In the Bash code the CHAIN macro uses a pointer to a char * inside the block. That sizeof (char *) offset means we skip over the ovu_magic, ovu_bucket and ovu_rmagic fields on all platforms.

Given the block is 64-bit aligned that means on a 64-bit platform you’re setting the next pointer in the (nominally) user-allocated section and on a 32-bit platform you’re overwriting some part of the overhead structure – it should be the ovu_size field.

It’s not clear why the (nominally) user-allocated section isn’t used all the time for the next pointer. Maybe it’s some heinous combination of the C macros on obscure systems.

The nearest I can think is that for bucket #0 (eight bytes) there is no user-data section so we’d be in trouble. But bucket #0 is a placeholder, we don’t actually use it.

I’ve set the Idio code so it always uses the user-data portion…and we’re still here!

mmap/sbrk

I’ve only used mmap(2) in this implementation but the Bash code goes a bit further and uses sbrk(2) for allocations under a page size.

I don’t know if that makes this implementation good or bad in relation. Obviously the Bash (née Emacs) code has both some considerable history and wider (operating system) distribution and so has to cover more bases.

All I can say is that…it works for me! I fancy there may be bridges to be crossed.

I also notice that sbrk(2) is deprecated in Mac OS X/Big Sur with the following commentary:

The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management.

Implementation

The implementation of malloc, calloc, realloc and free works pretty much as you might expect with having to rummage around finding buckets etc..

If you have run a debug build with --vm-reports then the file ./vm-perf will have some allocator stats for you.

This is where I thought there might be some improvements to be had. If you look at the stats you can see that one of two of the chains are far more popular than others. Perhaps not surprisingly, the smaller sized ones.

Pulling some numbers out:

bucket:

16

32

64

128

256

512

1024

2048

4096*

8192

free:

0

868K

1447K

41K

142K

2487

1592

181

0

0

used:

0

7679

233K

1644K

49K

185

24

11

41

0

peak:

0

876K

1680K

1686K

192K

2668

1614

192

42

6

mmap:

0

6846

26K

52K

12K

334

404

96

99

22

munmap:

0

0

0

0

0

0

0

0

58

22

you can see from the peak numbers that a lot of 32, 64 and 128 byte buckets were allocated. The actual number of those won’t change – we can’t change the user-code – but what we might tweak is the number of mmap calls, tens of thousands.

So I looked at asking for, say, 1000 pages at a time so that the number of mmaps would drop to the tens.

That makes almost no difference at all. Perhaps as you’d expect as you’re only reducing the number of system calls in the overall process run from the millions (I would presume) by a few thousand.

Last built at 2024-12-21T07:11:01Z+0000 from 463152b (dev)