Quantcast
Viewing latest article 1
Browse Latest Browse All 3

Answer by Peter Cordes for Relative performance of std::atomic and std::mutex

What hardware are you testing on?

Since you're using GCC, the std::atomic seq_cst store will be using a mov+ slow mfence instead of a somewhat less slow xchg-with-mem (which is also a full barrier, like all other x86 atomic RMW operations).

Taking a mutex costs an atomic RMW (like xchg, not mov + mfence). And if you're lucky release the mutex could be just a plain store (like mo_release). There is zero contention so acquiring the lock always succeeds.

It's certainly plausible that the the code behind those mutex lock / unlock library functions is less expensive than mfence, especially on Skylake CPUs with updated microcode where mfence is a full barrier for out-of-order execution as well as memory. (See the bottom of this answer, and also Does lock xchg have the same behavior as mfence?)


Also, note that your mutex loop optimized the local bool var into a register and isn't actually updating it in memory inside the loop. (Your code on the Godbolt compiler explorer with gcc4.8.5).

# the main loop from testMutex.L80:                                                       # do {        mov     rdi, rsp                                      # pointer to _mutex on the stack        call    __gthrw_pthread_mutex_lock(pthread_mutex_t*)        test    eax, eax        jne     .L91                                          # mutex error handling        mov     rdi, rsp                                      # pointer to _mutex again        call    __gthrw_pthread_mutex_unlock(pthread_mutex_t*)        sub     rbx, 1        jne     .L80                                        # }while(--counter)

xor bl, 1 inside the loop would be irrelevant; out-of-order exec could overlap that with other work.

If a reference to var escaped the function so the compiler had to have it in sync in memory before non-inline function calls (including to pthread library functions), we'd expect something like xor byte ptr [rsp+8], 1. That would also be pretty cheap and perhaps mostly hidden by out-of-order exec, although the load/ALU/store could be something that a full barrier would have to wait for when draining the store buffer.


Speeding up your std::atomic code:

You're intentionally avoiding doing an atomic RMW, it seems, instead loading into a tmp var and doing a separate store. If you use only release instead of seq_cst, that lets it compile to just a plain store instruction on x86. (Or to cheaper barriers on most other ISAs).

      bool tmp = _value.load(std::memory_order_relaxed);   // or acquire      _value.store(!tmp, std::memory_order_release);

This should run at about 6 cycles per inversion, just the latency of one ALU operation plus store-forwarding latency for the store/reload. vs. maybe 33 cycles per iteration for the best-case throughput for mfence (https://uops.info/).

Or since this is a non-atomic modification, just store alternating values without re-reading the old value. You can usually only avoid an atomic RMW in cases where only one producer is writing a value, and other threads are reading. So let the producer keep the value it's modifying in a register (non-atomic local var), and store copies if it.

   bool var = true;   for(size_t counter = 0; counter < CYCLES; counter++)   {      var = !var;      _value.store(var, std::memory_order_release);   }

Also, don't use leading underscores for your own var names. Such names are reserved for the implementation. (single _ with lowercase is only reserved at file / global scope, but it's still bad practice.)


Viewing latest article 1
Browse Latest Browse All 3

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>