perf: move EG() and CG() in ZTS builds into __thread storage#22231
perf: move EG() and CG() in ZTS builds into __thread storage#22231henderkes wants to merge 3 commits into
Conversation
|
Meant to write comments explaining a few peculiar decisions... but got carried away in the evening. Will get to it and tag relevant reviewers in the morning. |
|
|
||
| static void ts_free_resources(tsrm_tls_entry *thread_resources) | ||
| { | ||
| bool own_thread = thread_resources->thread_id == tsrm_thread_id(); |
There was a problem hiding this comment.
Here it's possible that a thread id was recycled, but the tls points to a now obsolete one
| } | ||
|
|
||
| if (!resource_types_table[i].fast_offset) { | ||
| if (!resource_types_table[i].fast_offset && !resource_types_table[i].tls_addr) { |
There was a problem hiding this comment.
we can't manually free __thread storage
| /* allocates a resource id whose per-thread storage is a native __thread block */ | ||
| TSRM_API ts_rsrc_id ts_allocate_tls_id(ts_rsrc_id *rsrc_id, void *(*tls_addr)(void), size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor) | ||
| {/*{{{*/ | ||
| TSRM_ERROR((TSRM_ERROR_LEVEL_CORE, "Obtaining a new TLS resource id, %d bytes", size)); |
There was a problem hiding this comment.
function largely copied from above, looking at it now I see that size_t should be printed as %zu.
| # define TSRM_TLS_MODEL_ATTR | ||
| # define TSRM_TLS_MODEL_DEFAULT | ||
| #elif __PIC__ | ||
| #elif __PIC__ && !defined(__PIE__) |
There was a problem hiding this comment.
a PIE program can use local exec if it's the main executable. Only shared libraries (embed, extensions) need to fall back to initial-exed.
This alone would already be a small speedup (one fewer instruction per access)
| AS_VAR_APPEND([CFLAGS], [" -DZEND_EG_TLS"]) | ||
|
|
||
| dnl -mtls-size=12 drops the dead high-bits offset add from TLS access, | ||
| dnl valid while the thread-local block stays under 4 KiB. |
There was a problem hiding this comment.
This would produce linker errors if tls size exceeded 4kb, but I did a test atatic build with 100 extensions statically compiled in (what a terrible idea) and tls size ended up at 3.7kb. I can't think of a way to test whether a link would succeed before compiling, so this is unconditional.
|
@arnaud-lb I also had a different idea on achieving the same thing by pinning resolved EG and CG to a register on gcc. It worked well, but
Perhaps you have a different idea here? |
So, I was chasing what caused the much higher instruction count on aarch64 for clang vs gcc and while figuring that out, got side tracked to figuring out that ZTS builds also have a ton of extra instructions compared to NTS builds. Turned out it was mostly just from constantly accessing executor and compiler globals when running phpstan. Each access became pseudo assembly:
Here we're slashing one extra load and one add by moving EG and CG onto actual thread storage. Benchmarks are using phpstan on full laravel codebase (full symfony codebase showed the same, but I added the LE optimization after and don't want to rerun 6 hours of benchmarks) and phoronix-test-suite phpbench.
aarch64:
x86_64:
LLM disclosure
claude