What hardware are you testing on?
Since you're using GCC, the std::atomic
seq_cst store will be using a mov
+ slow mfence
instead of a somewhat less slow xchg
-with-mem (which is also a full barrier, like all other x86 atomic RMW operations).
Taking a mutex costs an atomic RMW (like xchg
, not mov + mfence). And if you're lucky release the mutex could be just a plain store (like mo_release
). There is zero contention so acquiring the lock always succeeds.
It's certainly plausible that the the code behind those mutex lock / unlock library functions is less expensive than mfence
, especially on Skylake CPUs with updated microcode where mfence
is a full barrier for out-of-order execution as well as memory. (See the bottom of this answer, and also Does lock xchg have the same behavior as mfence?)
Also, note that your mutex loop optimized the local bool var
into a register and isn't actually updating it in memory inside the loop. (Your code on the Godbolt compiler explorer with gcc4.8.5).
# the main loop from testMutex.L80: # do { mov rdi, rsp # pointer to _mutex on the stack call __gthrw_pthread_mutex_lock(pthread_mutex_t*) test eax, eax jne .L91 # mutex error handling mov rdi, rsp # pointer to _mutex again call __gthrw_pthread_mutex_unlock(pthread_mutex_t*) sub rbx, 1 jne .L80 # }while(--counter)
xor bl, 1
inside the loop would be irrelevant; out-of-order exec could overlap that with other work.
If a reference to var
escaped the function so the compiler had to have it in sync in memory before non-inline function calls (including to pthread library functions), we'd expect something like xor byte ptr [rsp+8], 1
. That would also be pretty cheap and perhaps mostly hidden by out-of-order exec, although the load/ALU/store could be something that a full barrier would have to wait for when draining the store buffer.
Speeding up your std::atomic
code:
You're intentionally avoiding doing an atomic RMW, it seems, instead loading into a tmp var and doing a separate store. If you use only release instead of seq_cst, that lets it compile to just a plain store instruction on x86. (Or to cheaper barriers on most other ISAs).
bool tmp = _value.load(std::memory_order_relaxed); // or acquire _value.store(!tmp, std::memory_order_release);
This should run at about 6 cycles per inversion, just the latency of one ALU operation plus store-forwarding latency for the store/reload. vs. maybe 33 cycles per iteration for the best-case throughput for mfence
(https://uops.info/).
Or since this is a non-atomic modification, just store alternating values without re-reading the old value. You can usually only avoid an atomic RMW in cases where only one producer is writing a value, and other threads are reading. So let the producer keep the value it's modifying in a register (non-atomic local var), and store copies if it.
bool var = true; for(size_t counter = 0; counter < CYCLES; counter++) { var = !var; _value.store(var, std::memory_order_release); }
Also, don't use leading underscores for your own var names. Such names are reserved for the implementation. (single _
with lowercase is only reserved at file / global scope, but it's still bad practice.)