celeritascelery 3 days ago

Interesting write up. I am curious about the usage of this framework. One thing in reference counting’s favor is that it is hard to screw up. You just use the rc type and when it is no longer accessible it will magically get freed. That is how GC works in a high level language too, you almost never need to think about it.

Let’s look at the example from the docs:

    SWL_GCArena gc = swl_gc_arena_new(<starting-capacity>);
    int* foo = swl_gc_alloc(&gc, 5000, alignof(int), NULL);
    int* bar = swl_gc_alloc(&gc, 16000, alignof(int), NULL);
    int* baz = swl_gc_alloc(&gc, 3000, alignof(int), NULL);
    // foo and baz are roots
    SWL_GC_COLLECT(&gc, SWL_ROOT(foo), SWL_ROOT(baz));
    // bar is gone, poof
If you try to access bar at the end you get a use after free. I am pretty sure if you tried to access foo again it would trigger the same problem, because it is a pointer directly to int and not an indirection. What happens if you pass a non-arena pointer to SWL_ROOT? What happens if you have multiple arenas and you mix and match them?

There are many ways you could trivially shoot yourself in the foot in large program. It is a lot less “idiot-proof” than simple reference counting. The fact that you create each piece of memory and then have to manually pass it back to the root macro on each collection is very similar to the “malloc/free” dance we want to try and avoid.

The rooting issue is something I tried to tackle in my GC for rust[1]. Essentially making low level GC "idiot proof".

[1] https://coredumped.dev/2022/04/11/implementing-a-safe-garbag...

  • sirwhinesalot 3 days ago

    The idea is that this would be used in a language backend, so the compiler would track the roots for you. Note that in C ref counting is also rather error prone in that you can increment or decrement a reference the wrong number of times.

    But I agree this is not the best thing to do manually. The main advantage is that you only need to call collect in 1 place in your whole app. A lot easier to get right than having free all over the place.

    EDIT: placed your article on my reading list, thanks for sharing it!

  • forrestthewoods 3 days ago

    > One thing in reference counting’s favor is that it is hard to screw up.

    Well, circular references will cause your RC objects to never ever drop. So then you have to start using a bunch of Weak references. And it can very quickly become a confusing and hard to understand mess.

    The cost of a memory leak is significantly lower than the cost of a use after free. But disentangling the graph of refcounts can be super painful.

efitz 3 days ago

Despite the title “tracing garbage collection for arenas”, the article is not about analysis of sanitation practices at sports facilities.

  • xcskier56 3 days ago

    I was definitely convinced that the article was somehow going to pivot to this topic until it started getting into "arenas"

  • sirwhinesalot 3 days ago

    This got a healthy chuckle out of me, well done :)

  • shiroiushi 3 days ago

    Yep, this article's headline was very deceptive IMO.

  • colechristensen 3 days ago

    I'd totally watch a 5-30 minute documentary about waste management in large venues. Bonus points if there were questionable infographics about beer and hotdogs.

pjmlp 2 days ago

> Interestingly Swift’s predecessor Objective-C, older versions of Nim, and early versions of Rust all supported tracing garbage collection! Yet they all abandoned it.

Yes, because it was a failure mixing a conservative GC with the underlying C semantics in Objective-C, so the applications would regularly randomly crash due to pointer misuse.

The alternative was to have the compiler take care of the Cocoa's retain / release message pattern, like VC++ can also do for COM AddRef/Release, while selling the pivot as a success, in good old Apple fashion.

Then Swift, naturally had to use a similar approach, the alternative would be a complex marshaling layer like the CCW/RCW that .NET uses to talk to COM, and was redone in .NET Core with COM Wrappers source generators.

Nowadays everyone not around for the party, always takes the wrong conclusion why Objective-C moved away from a tracing collector.

  • sirwhinesalot 2 days ago

    The "underlying C semantics" is a very important point there though. Consider C++, adding a GC to it has the same problem. The GC in Rust was also a complete mess, they're kinda all in the same bucket.

    Swift, had it not needed backwards compatibility with Objective-C, could probably have gone with a tracing collector. It's a higher level language than the other ones.

    • pjmlp 2 days ago

      Yes, to the point that only two production quality GC C++ exist, C++/CLI and Unreal C++.

      C++/CLI is only considered safe, as long as pure MSIL is being generated, and specific C and C++ features are forbidden, if you make use of them, the code is considered unsafe by the verifier, and some surprises might happen, just like with Objective-C.

      Whereas Unreal C++ makes use of specific macros to mark the GC aware objects, and possible bad behaviour is covered by the fact that those objects are mainly used from Blueprints, not raw C++ code.

      • sirwhinesalot 2 days ago

        Yeah that makes perfect sense, I don't think you can do better than that.

        • whizzter 2 days ago

          Actually, once C++ has reflection it'll be possible to get rid of the ugly macros and the "only" issue left is root-pointers/shadow stacks. Googling just now to find the boost_pfr library I noticed a newer one that would suit the bill ( https://github.com/boost-ext/reflect/ ).

          Makes me itch to rewrite a small gc I wrote a couple of years back to use that to see how "painless" it's possible to make it without ugly MINIGC_AUTOMARK(..) macros ( https://github.com/whizzter/minigc )

          • sirwhinesalot 2 days ago

            Reflection will make it trivial to generate tracing code automatically, so roots are the only real issue. If you do like me and require that the roots be explicitly passed to collect, then the programmer can just make the shadow stack by hand.

            Though of course this will still be a much simpler beast than the sort of concurrent GC you see in managed languages. Supporting that sort of GC is a whole other ballgame, needing safepoints inserted into functions and things like that.

            YAGNI imo.

    • frou_dh 2 days ago

      > Swift, had it not needed backwards compatibility with Objective-C, could probably have gone with a tracing collector.

      Programs using tracing GC have a higher memory ceiling than the equivalent using RC, don't they? Because things hang around for a little (or a lot) longer than is strictly needed. Apple's M.O. has been to put as little RAM in the iPhone as they can get away with.

      • pjmlp 2 days ago

        Yes, on the other had, the optimizations to make RC impact run-time execution as little as possible, slowly turn them into a kind of tracing GC algorithm anyway.

        Specially when you add stuff like background cleanup threads to handle the domino effects of when a pointer reaches zero, and it is pointing to a graph structure full of pointers that will themselves reach zero.

        And then there is the whole non-moving memory blocks and related fragmentation, unless the pointers get turned into handle like ids.

      • sirwhinesalot 2 days ago

        That's also a good point, I remember hearing complaints about the Objective-C GC around the time relating to that.

      • kaba0 2 days ago

        Yes, at the price of taking time away from the mutator thread, while a tracing GC can do useful work in parallel. Arguably though, especially for the early mobile devices an RC might be a better choice.

hayley-patton 3 days ago

> In a custom language, this can be implemented very easily following the approach outlined in this paper: Accurate garbage collection in an uncooperative environment.

Cheney's algorithm and conservative roots don't mix too easily, as you cannot move conservative roots: an integer which looks like a pointer shouldn't be changed by moving. https://dl.acm.org/doi/10.1145/2660193.2660198 has an overview of algorithms which don't move conservative roots (including never moving, as in Boehm-Demers-Weiser).

  • sirwhinesalot 2 days ago

    The paper I linked is not the Boehm one, hence the "Accurate" in title (they got clever with it didn't they?), you need to use a shadow stack to store the roots, you can't use conservative stack scanning for the reasons you stated.

    • hayley-patton 2 days ago

      Dang, it's the shadow stack paper, not the BDW one! Sorry, my comment doesn't apply to shadow stacks and thus doesn't apply to the article.

zozbot234 3 days ago

Interesting discussion. If you want to experiment with 'lightweight' traced GC for systems programming, you might like https://github.com/chc4/samsara (for Rust) or https://github.com/pebal/sgcl (for C++) both implementing a concurrent tracing algorithm which is a bit more involved than what's directly discussed in OP.

  • sirwhinesalot 3 days ago

    Hi author here, I wasn't aware of those, I should probably link to some existing library implementations of GCs.

    My goal was really to make the simplest, dumbest thing possible that still works. It can't compete with more serious GCs like that, but hey, it's intentionally not meant to.

    • titzer 3 days ago

      Virgil uses a simple stop-the-world semispace copying collector. I did that because it's better to start moving rather than nonmoving. I think the hotly contested questions in systems programming are all complete red herrings. It's not about GC versus non-GC, it's about gracefully handling unsafe code, integrating JITed code, interfacing with the lowest level APIs (e.g. kernel syscalls), implementing the runtime system, etc. Data formats like network packets, file formats, in-memory data structures, etc. That stuff is way more important than if your goddamn lists of widgets are heap, stack, or region allocated.

      • sirwhinesalot 3 days ago

        Yup, 100% agreed. GC is a tool, we need to talk about specific GCs and their costs as to whether or not they're suitable for a particular task.

        For me the main issue has to do with FFI and needing to drag around a hefty runtime to make a library available to other languages.

        Not an issue with simpler GCs.

        • zozbot234 3 days ago

          FFI from GC'd to non-GC'd languages becomes workable if you can manually add FFI-referenced objects as roots (so that a collection cycle will not free them or anything they reference in turn) and ensure that the GC can call finalizers in order to ensure proper disposal (such as decreasing reference counts or freeing memory) whenever a GC object happens to control the lifecycle of non-GC'd resources.

          • neonsunset 3 days ago

            You pretty much described how FFI works in .NET :)

            When you are passing a pointer across FFI which points to an object interior, you "pin" such object with `fixed` statement which toggles a bit in the header of that object. Conveniently, it is also used by GC during the mark phase so your object is effectively pre-marked as live. Objects pinned in such way cannot be moved, ensuring the pointer to their interiors remains valid.

            It's not as simple implementation-wise - .NET's GC has multiple strategies to minimize the impact of pinned objects on GC throughput and heap size. Additionally, for objects that are expected to be long-lived and pinned throughout their entire lifetime, there's a separate Pinned Object Heap to further improve the interoperability. In practical terms, this is not used that often because most of the time you pass a struct or a pointer to a struct on the stack that needs no marshalling or pinning. In the rare case the pointer needs to point to a long-lived pinned array, these are allocated with `GC.AllocateArray<T>(length, isPinned)`.

            .NET has another interesting feature that is important to FFI: ByRefs which are special pointers that GC is aware of that can point to arbitrary memory. You can receive an int*, len from C and wrap it into a Span<int> (which would be ref int + length), and GC will be able to efficiently disambiguate that such pointer does not point to GC-owned heap and can be safely ignored while should it point to object memory, the memory range needs to be marked and byrefs pointing to it need to be updated if it is moved. That Span<int> can then be consumed by most standard library APIs that offer span overloads next to the array ones (there are more and more that accept just spans, since many types with contiguous memory can be treated as one).

            This works the other way too and there is a separate efficient mechanism for pinning byrefs when passing them as plain pointers to C/C++/Rust/etc.

          • sirwhinesalot 3 days ago

            I meant the other direction actually, non-GC'd calling GC'd. FFI from GC'd to non-GC'd has its issues but is thankfully much better explored.

            With a minimal ref counting "GC" on GC'd side, you just need 2 extra "runtime" functions (incref, decref), which happen to fit like a glove with the scope-based resource management you have in Rust/C++/Swift/C (the latter with GCC extensions).

            With a tracing GC, my hope was to prove that if you make it as "minimal" as your usual ref counting implementation, then it's also just a few functions (one to init a GC heap, one to allocate on that heap, one to collect that heap), which can hardly be considered a "runtime".

            Your typical tracing GC is a large, powerful and complex beast that doesn't play nice with anything besides the language/vm it was designed for.

      • diffxx 3 days ago

        I am confused by this comment. All the enumerated features are dependent directly or indirectly on the memory management approach that the system uses. If one chooses badly for the target workload, there are many things that will go wrong including excessive copying (which kills the performance of the system) and inscrutable crashes/logic errors as well as security vulnerabilities.

        Personally, I tend to think that if you nail the memory management technique all the other items will tend to work themselves out. And I think this is actually harder to get right _systemically_ than all of those other things.

        • zozbot234 3 days ago

          > I am confused by this comment. All the enumerated features are dependent directly or indirectly on the memory management approach that the system uses.

          The key feature of low-level systems programming is that there's no such thing as "the" one memory-management approach. GC with arenas can be used in a "pluggable" way for the limited case of objects referencing one another in a fully general graph (including possible cycles) while preserving other, more lightweight approaches (including refcounting and simple RAII/static ownership tracking) for the bulk of cases where the full generality of GC isn't always required.

        • titzer 3 days ago

          > if you nail the memory management technique all the other items will tend to work themselves out.

          From my experience, implementing a language runtime for a GC'd language in a non-GC'd language will lead you down one of two well-trodded roads full of footguns and booby traps, which includes either conservative-on-the-stack scanning or a huge PITA handle system which has a bug tail measured in decades. So you want to implement a GC'd language in a GC'd language. But that's probably not what you meant.

          To the larger point, no, absolutely not. There are so many other considerations for systems programming that this whole "GC or not" debate is a waste of time. Read the comment again. System programming deals with constraints imposed by other hardware and software, as well as the need to look at the implementation guts of things, generate new code, etc. Solving the language's memory management problems is only tangentially related, meaning, if you do solve it, then you still have these problems left over to solve.

ctxcode 2 days ago

Interesting article. 16-24 bytes per allocation is pretty big and it sounds like it would a bit too much overhead, an actual demo would be nice.

My approach with Valk is to allocate per type size and in incremental sized blocks. These blocks work alot like tiny bump allocators and can be reset with 1 line of code without losing the objects in it that are still used.

If the overal memory usage increased a certain percent, we trace the stack, clean up the blocks from long existing objects and free blocks that are completely empty.

My benchmark performes 2x faster than rust, and 4x golang. We use 8 bytes per allocation.

  • sirwhinesalot 2 days ago

    Other nice people on this comment thread pointed out that I can bring it down to 8, I had just missed some simple tricks I could do (like overriding the object data with the forwarding pointer since that data is garbage anyway).

    So 8 bytes per allocation sounds about right.

    How do you deal with variable size types like arrays?

    • ctxcode 2 days ago

      We dont have static arrays for GC types. Creating multiple objects is always 1 by 1 and we would put them in a dynamic array. We do support non-GC structs, which allow you to create static arrays in a single allocation. I havent spent much time on this problem yet. I might figure it out in the future, but it's kinda low-priority for me.

moonchild 3 days ago

greenspun's tenth rule:

> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

moonchild 3 days ago

i would do a variable-size header 4 or 8 bytes for most small objects (8 if you want to provide alignment, 4 if not or if you can squish your pointers). one bit identifies if the header is small or large; if small, next few bits give the size of the object in words, and the remainder are a bitmap telling which words are pointers. forwarding pointer stomps on the first word of the object

  • sirwhinesalot 2 days ago

    That's pretty clever, into the idea pile it goes, many thanks!

jkcxn 3 days ago

How do you move memory and have all the pointers update to the new position? I didn't understand that part.

  • tekknolagi 3 days ago

    The pointers that the programmer controls live on the shadow stack and that's how they get updated. Another term for this is a "GC handle"

  • sirwhinesalot 3 days ago

    Hi, author here.

    It's handled through the actual GC implementation, Cheney's Algorithm to be specific, it's mentioned in the blog post but I didn't write how it actually works there.

    My git repo link at the end has an actual C implementation if you'd like to have a look.

    The TL;DR is that the algorithm starts by going through the roots and copying each of them to a new Arena. Each time it does it updates the root pointer to the new location and writes out a "forwarding pointer" in the old allocation such that if 2 roots go to the same object, they can check if it's already been forwarded, and replace themselves with that pointer without copying.

    Once all the roots are copied, the algorithm goes iteratively through every allocation in the new Arena, digging through their pointers, doing the same copy/update/forward/dance onto the same Arena, until it reaches a fixpoint (no new copies happen). Then it frees the old Arena.

    It's a really simple algorithm, I strongly suggest having a look at the wikipedia page :)

hencq 3 days ago

In the Zig/Odin example, I would imagine that any pointers from other (non-GC) arenas into the default (GC) arena, would need to be added to the roots for this to work right?

  • sirwhinesalot 2 days ago

    Yes, correct. I would avoid going in that direction if possible, it's best to go from the GC Arena to the other Arenas.

    • ngrilly 2 days ago

      And in a language like Zig, how would you track the roots, and update the pointers if/when memory is moved during GC?

      • sirwhinesalot 2 days ago

        If using Zig directly, then it would have to be in a similar manner as my C implementation, you have to compute the roots by hand.

        Zig has extensive metaprogramming capabilities, so it should be possible to automate quite a bit of it. Particularly generating the "tracing/moving" function using comptime is trivial to do in Zig, in C you have to write it out by hand.

        For a language like Zig (whose compiler you control), it's pretty easy. You check the types of every local variable in a function if they're GC pointers during typechecking. You store those variables in a function-local struct instead of directly as locals during codegen, which then gets pushed into a shadow stack as the first instruction the function does when called.

        The shadow stack is just a thread-local linked list to these call-stack-structs. When the function ends, it pops its struct from the shadow stack (no allocations needed, since everything is on the stack).

        When a function calls collect, it'll traverse the shadow stack as the set of roots. The approach is explained in one of the papers I link to in the article, Accurate Garbage Collection in an Uncooperative Environment.

        If you need to share data across threads, you'd need to use a separate shared heap for that data, which would be a lot more complicated, or just use some other memory management solution for that.

        A very important point is not allowing libraries to call collect directly on the global heap, that defeats the point of leaving it up to the application if and when the GC gets called.

        • ngrilly 2 days ago

          Thanks a lot. This is very clear.

forrestthewoods 3 days ago

> Can we make “the simplest form” of tracing garbage collection work for systems programming?

No. 24 bytes per allocation and having to perform a copy and point fixup is too expensive.

The problem with GC is that it’s fine until it’s not. And when it becomes not fine you need to radically change your entire architecture. There is no incremental fix.

That said, Unreal Engine does in fact use a mark-and-sweep GC for certain types of objects. Which could be something for the author to look into.

  • amelius 3 days ago

    Manual garbage collection can also be expensive, in some cases even more expensive than automatic garbage collection.

  • hayley-patton 3 days ago

    Where do you get 24 bytes of overhead from? A mark-sweep collector only needs a bitmap of 1 bit per alignment granule.

    edit: oh, that's what the article uses - the tracing function and size function can be per-heap and the forwarding pointers can overwrite the start of objects, to just have enough header to support tracing and sizing, which could be one word.

    • sirwhinesalot 2 days ago

      Hi Hayley, could you explain how the tracing function and size info can be packaged into one word?

      Since the GC Arena is meant as a default heap, the size should be allowed to be pretty big (> 32 bits). And doesn't the tracing function need to support a full pointer size? since they're not allocated but rather stored in the executable I guess they'll be placed in some predictable place in the address space, so I could take advantage of that.

      Is that what you're considering?

      • hayley-patton 2 days ago

        We're going to put just a type ID in the word - you could probably get away with much smaller than a 64-bit ID, but padding will round it up to 64 bits in any case. The heap has exactly one tracing function and size function for all objects, and the functions check the type ID to determine what type they're looking at.

        • sirwhinesalot 2 days ago

          Ah I get what you mean now. Yeah that would totally work. Having a custom tracing function per type lets the host language (C in this case) also store its own types in the GC Heap if so desired quite easily though, since all the programmer has to do is call gc_move(...) once per struct field in that function.

          Writing out a typeinfo struct of some sort by hand is not as simple.

          Maybe a good middle ground would be to register the tracing function separately with the GC Heap, which would return an ID for later use with gc_alloc.

          I could also make the size just be the return value of the same function. The question is then what's worse, an extra 8 bytes in the header of every allocation, or the time spent in the function calculating the size. Since the function has to be called anyway, I'm guessing there's no point in storing it.

          That would mean even the ID is not necessary, since an 8byte header is about as good as you can get due to padding.

    • moonchild 3 days ago

      > alignment granule

      if you have metadata to identify object starts then you can do 1 bit per min object size (which can be bigger than the alignment granule)

  • sirwhinesalot 3 days ago

    Are 16/24 bytes that much worse than 8/16 (with weak refs) for ref counting?

    The copying can be too expensive yes, though hopefully the ability to use non-garbage collected arenas would mitigate that issue (just don't put huge live data in the default, GCed arena)

    • forrestthewoods 3 days ago

      > Are 16/24 bytes that much worse than 8/16 (with weak refs) for ref counting?

      No, but that's not the right question. Most allocations in languages like C/C++/Rust aren't using ref counts. In fact excessive use of std::shared_ptr is a well known footgun in C++!

      > the ability to use non-garbage collected arenas would mitigate that issue

      I agree that not using GC is a way to mitigate GC issues! But if you're only able to performantly use GC for a limited number of allocations you also aren't getting much benefit from GC.

      • sirwhinesalot 3 days ago

        Note that you can allocate as much temp data as you want to the GC heap without incurring any performance issues (in fact allocation will be faster than malloc), because copying collectors only ever care about live objects.

        So the question is, how much of an issue it is to "contort" the program to not allocate large, live objects to the GC heap, vs contorting it to have a tree-like structure for RAII-style memory management.

        Honestly, I have no idea, one of the reasons I really want to make a toy language to try it out and actually measure the runtime costs vs convenience etc.

        Note: For my particular C implementation, which is admittedly cobbled together with spit and ducktape, you can easily add support for individually managed allocations to the GC heap, since it already uses buckets anyway.

        If a bucket contains only 1 allocation, there's no need to copy, you can just keep that bucket somewhere in the linked list. This does require the programmer to explicitly request that kind of allocation though.

        For the "huge amount of small objects" case, GC and RAII/RC are complete opposites. If most of those small objects die, the GC will be a lot faster since it drops them by not copying them. But if most of the objects are live, then the GC will spend a brutal amount of time copying, while RAII/RC will barely do any work.

        Having memory management options is a good thing ;)

        • forrestthewoods 3 days ago

          I don't think it's fair to merely compare your GC Arena to malloc. You've also got to compare a GC arena to a vanilla arena as well.

          It's hard for me to see how all these GC variants make it easier to write a complex program.

          > Having memory management options is a good thing

          Having shipped Unity games I've spent a LOT of time waging a war with GC. GC's have caused me far far more pain than value over the past 20 years.

          I think your project is pretty cool and you should keep pushing. But, respectfully, I'm extremely skeptical and not convinced! :)

          • pjmlp 2 days ago

            Yes, and Unity is famously known for having a bad GC implementation from Mono stone age, that they never updated due to license disagrements with Xamarin, they never invested into having one of their own, rather HPC# instead, and apparently are still years away of adopting .NET Core.

            Capcom doesn't seem to suffer the same issues, with their .NET based engine for Playstation, based on .NET Core fork.

            Unity has been both a bless, and a curse to the use of C# in the games industry.

            • neonsunset 2 days ago

              While Unity's GC deserves the flak it gets, it was improved eventually by introducing incremental garbage collection mode that divides the work between the frames: https://docs.unity3d.com/2020.3/Documentation/Manual/perform...

              It is also important to acknowledge the work Unity has done with its Burst compiler and their homegrown SIMD/BLAS capabilities: https://docs.unity3d.com/Packages/com.unity.burst@1.4/manual...

              Burst is also capable of auto-vectorization. It is effectively using a subset of C# as DSL but nonetheless a tool that serves its purpose. It also has various assertions to ensure that the performance does not regress, like failing the build if the code change breaks auto-vectorization, something I have not seen in C++ lands (please do correct me if there is a comparable tool).

              With the ongoing move to CoreCLR, Unity's own intrinsics and SIMD/Math abstractions will be replaced by in many ways identical ones that are present in .NET's CoreLib. As for Burst, I suppose we'll just have to wait and see.

              I am most interested in what Unity will do in regards to GC implementation - .NET supports custom GCs and it is high time someone implemented one, particularly for consistent latency sensitive scenarios. The existing implementation is good, but games really do love determinism and relatively few developers can do what PPY did to achieve 1000hz game loop in OSU!.

              • pjmlp 2 days ago

                Yes, they have done a ton, but perceptions are also hard to change.

                I am looking forward to their ongoing efforts.

                Regarding custom GCs, although CLR offers similar extension points as the JVM, maybe due to the usual way Microsoft was traditionally seen, very few people ever bothered to play with custom GCs, or pluggable debug interfaces, or JIT code rewritting.

                Lots of interesting features there.

          • sirwhinesalot 2 days ago

            Skepticism is good. You should not be convinced until I can provide some actual evidence. Sadly I don't know when I'll find the time.

            I don't think GC is a good fit for games, so there I wouldn't recommend using this or any other existing GC really. Maybe if I find a way to make collect incremental so you can call it only once per frame with a time budget, then it could work.