#1429 new planned

control effort for global buffer allocator

Reported by: Ichthyostega Owned by:
Priority: lesser Milestone: 2beta
Component: lumiera Keywords: render performance tuning design
Sub Tickets: #1396, #1426 Parent Tickets: #835, #966, #1315, #1387

Description

As Lumiera Architect,
I want to control the effort for managing buffer memory,
to strike a good balance between avoiding excess allocations and blocking render workers.

description of the problem

For the BufferProvider implementation used in the actual Render Engine, a set of thread-local storage pools is used. The intention is to calibrate this system in such a way that most of the buffer allocation requests can be serviced from a small pool of allocations already associated with the current worker thread. Yet those local memory managers should not perform actual allocations on their own, since doing so incurs the risk of global contention.

Requests for new allocations, as well as old allocation no longer needed in the local pool are thus sent over a lock-free queue towards a central service, the Engine Buffer Manager. And while this setup removes the effort for maintaining a global pool of buffer allocations from the worker threads, in the hope that they have something else to do meanwhile — a new problem of control and coordination is created, touching on various tricky questions:

  • a worker first announces the new memory requirements, based on a global pre-computation at the start of a render job
  • but how quickly are these allocation actually required on average? If an allocation does not arrive in time, the LocalBufferStore will issue a synchronous blocking request at the point where the memory is required. Since we can not afford to keep track of each request individually, there is no way to cancel out the asynchronous request at that point, so that we'll end up with a duplicated allocation sitting in the local pool, until that overhead is detected eventually at the next clean-up step.
  • furthermore, any excess allocation will be sent back from the local pool to the central EngineBufferMemory, creating yet some further effort to place those allocations back into the appropriate pool and possibly even readjust the pool size.
  • all this additional work must be performed anyway, and doing so requires to push aside the actual render processing related to some worker; yet the difficult question is how this can be achieved
    • we could perform this internal management work always directly from the worker, which in fact implies not to use asynchronous hand-over, but to use a global lock rather.
    • we could create a background thread to wake up periodically and handle those tasks; such a setup is probably the simplest and clearest solution, but it implies that this management thread will regularly compete with some worker for CPU time.
    • we could schedule dedicated management jobs whenever a worker issues an allocation request; such management jobs would take precedence over the next render job, thereby avoiding the resource contention since they would be handled at a point where some larger render job has been completed — but the downside is that scheduling a job incurs some overhead, which might be larger than the actual time spent with memory management and clean-up

how to decide

It might not be possible to settle upon a single »right way« of dealing with those problems. So first and foremost we thus need a way to find out about the overall effort spent on those tasks, and we need to see how to balance throughput against unused excess allocations. Depending on the results from these observations, we might conclude that the effort for memory management is negligible, or it might incur such a substantial overhead that it has to be factored into the overall load management, or we might even end up implementing a dynamic control...

Change history (1)

comment:1 by Ichthyostega, at 2026-05-13T00:37:17Z

how to do the Locking?

This is another, quite related aspect.
I can see two contrariant approaches:

  • lock as short as possible, only for the actual alloc-work, with a non-recursive lock, to minimise interference between the threads
  • lock to cover each of the two relevant API entrances completely, to avoid frequent Kernel-calls when processing a series of requests

There are good arguments in favour for each of these opposing preferences, and a solid decision must be based on empirical findings. However, not being able to conduct such an investigation in the current stage, I decide based on gut feeling to favour the second balancing, since I'm under the impression that our job sizes are comparatively large and overall this environment exhibits not much contention, so that it is preferable to have one single, focused processing, and to arrange the locking such that it's correctness is obvious.

Note: See TracTickets for help on using tickets.