Work Packets: A New Abstraction for GC Software Engineering, Optimization, and Innovation
This program is tentative and subject to change.
Garbage collection (GC) implementations must meet efficiency and maintainability requirements, which are often perceived to be at odds. Moreover, the desire for efficiency is often at odds with agility, undermining rapid development and innovation, which harms longer-term performance aspirations. Prior GC implementations struggle to: i) maximize efficiency, parallelism, and hardware utilization, while ii) correctly and elegantly implementing complex algorithm-specific optimizations and scheduling constraints. This struggle is reflected in today’s monolithic implementations with simple phase-based synchronization.
This paper presents a new design for GC implementations that emphasizes agility and efficiency. The design simplifies and unifies all GC tasks into work packets which define: i) work items, ii) kernels that process them, and iii) scheduling constraints.
Our simple insights are that GC implementations are high-level algorithms that orchestrate vast numbers of performance-critical work items, and that execution is dominated by a few very small, heavily executed kernels. Work packets comprise groups of like work items, such as the scanning of a thread’s stack or the tracing of a single object in a multi-million object heap. The kernel attached to a packet specifies how to process items within the packet, such as how to scan a stack, or how to trace an object. The scheduling constraints express dependencies, e.g. all mutators must stop before copying any objects. Fully parallel activities, such as scanning roots and performing a transitive closure, proceed with little synchronization. The implementation of a GC algorithm reduces to declaring required work packets, their kernels, and dependencies. The execution model operates transparently of GC algorithms and work packet type. We broaden the scope of work-stealing, applying it to any type of GC work and introduce a novel two-tier work-stealing algorithm to further optimize parallelism at fine granularity.
We show the software engineering benefits of this design via eight collectors that use 23 common work packet implementations in the MMTk GC framework. We use the LXR collector to show that the work packet abstraction supports innovation and high performance, outperforming the latest (OpenJDK 24) instance of the highly-tuned, state-of-the-art G1 garbage collector. We thus demonstrate that work packets achieve high performance, while simplifying GC implementation, making them inherently easier to optimize and verify.