#1427 new planned

performance-impact of thread_local

Reported by: Ichthyostega Owned by:
Priority: nice Milestone:
Component: lumieraVault Keywords: platform performance research technology
Sub Tickets: #1396 Parent Tickets: #111, #966, #1006, #1033

Description

As Lumiera Architect,
I want potential performance issues relating to the use of thread_local to be investigated,
to strike a balance between low-level optimisation and code complexity.

potentially insidious problem

The C++ language offers convenient usage of thread local storage in the form of static variables declared with the storage-class thread_local. When building software for a massively concurrent environment, the utmost care must be taken to avoid any form of contention, lest the potential for effective parallelisation gains might be squandered. The best way to cope with this problem is through an architectural approach — and doing so often leads to using some thread local structures.

Unfortunately it seems that the current GNU toolchain relies on techniques that are sloppily implemented and can be drastically inefficient under some quite common conditions.

  • basically, a single indirect dispatch for any access can not be avoided, since the actual address of the data to use must be resolved through a dispatch table associated with the thread. However, when done properly, this can be achieved with a single assembly instruction.
  • the matter is complicated however by the fact that C++'s guarantee of a deterministic memory management — which translates into the requirement for the runtime-system to invoke constructors and destructors at well defined points, with absolute reliability.
  • furthermore, additional code can be loaded from shared objects, which implies that loading a dynamic library additional variables, that must be constructed and destroyed, and that potentially might be thread-local.

And while these additional requirements as such impose some problematic trade-offs, seemingly the topic was either not deemed relevant enough by the toolchain developers, or maybe a suitable solution was not pursued since that would require collaboration between the compiler developers and other toolchain- and platform components, which do not share the mindset and priorities related to the C++ language.

Thus it was observed, by introspection of generated assembly, that the GCC and GNU runtime system inserts drastically inefficient code constructs to deal with the aforementioned complications. Notably the presence of a single thrad_local variable with non-trivial constructor may cause the compiler to switch form a single indirect assembly call to a convolutes sequence that invokes a platform function. And, to make matters worse, the implementation for -fPIC is needlessly convoluted, and is imposed even when the final linking step does not lead to a shared object. Furthermore, the thread_local storage is not managed at a single location, but scattered over several places, which causes multiple dispatches through further platform function and in the worst case even checks of atomic flags (which kind-of defeats the whole purpose of thread_local.

These problems were noticed by several observers; even bugs were filed that unfortunately went unaddressed of more than 8 years. See this Blog entry for a detailed analysis of the problems.

assessment

Based on this description of the problem, I arrive at the rough guess that an access to some thread_local variable might easily impose an overhead of 500ns, while, in theory, the unavoidable costs could be in the low two-digit figures. Compounding our usage for the render buffer memory management, these numbers could sum up to 250µs unnecessary overhead per frame job. This is a rough guess however, and without empiric research it seems impossible to qualify if this kind of overhead is even detectable in our use-case.

If this issue turns out to be a real concern, the detailed knowledge regarding our actual usage patterns could be exploited to build a custom solution for thread-local access that might approach the optimal performance. The reason is that an application development project, different than a library solution, does not need to protect against all possible arcane corner cases, and its usage could be limited to a small number of hand-selected instances, while for all other usages the generic (yet suboptimal) library solution might be good enough.

A hand-made solution would work as follows:

  • use a single thread-local pointer (without constructor!) as access point
  • build a dispatch trampoline by metaprogramming + a static counter
  • fetch through the single thread-local pointer with an offset given by the counter
  • possibly add a custom storage for this dispatch table explicitly, and only to those threads that actually need the feature; this can be implemented as a wrapper to the thread-function

Change history (1)

comment:1 by Ichthyostega, at 2026-04-20T20:33:59Z

apparently less dramatic than was feared

As part of the LocalSlice_test I run some mircobenchmarks

  • the type in question does have a constructor — albeit a trivial one
  • the tests are built with -fPIC and linked into a shared library

There is a measurable overhead, yet rather a factor 2 ... 3; access times for a thread-local instance are in the low double digit nanoseconds, which is acceptable for anything than a tight inner loop.


However, these measurements were done in a typical benchmark setup, where a small test subject is pressed a million times. Since memory management and concurrency are a problematic topic, I'll leave that test open as a reminder for a later re-investigation, when a full-fledged Render Engine can be observed under real-world conditions.

Note: See TracTickets for help on using tickets.