What hardware are you testing on?

Since you're using GCC, the std::atomic seq_cst store will be using a mov+ slow mfence instead of a somewhat less slow xchg-with-mem (which is also a full barrier, like all other x86 atomic RMW operations).

Taking a mutex costs an atomic RMW (like xchg, not mov + mfence). And if you're lucky release the mutex could be just a plain store (like mo_release). There is zero contention so acquiring the lock always succeeds.

It's certainly plausible that the the code behind those mutex lock / unlock library functions is less expensive than mfence, especially on Skylake CPUs with updated microcode where mfence is a full barrier for out-of-order execution as well as memory. (See the bottom of this answer, and also Does lock xchg have the same behavior as mfence?)

Also, note that your mutex loop optimized the local bool var into a register and isn't actually updating it in memory inside the loop. (Your code on the Godbolt compiler explorer with gcc4.8.5).

# the main loop from testMutex.L80:                                                       # do {        mov     rdi, rsp                                      # pointer to _mutex on the stack        call    __gthrw_pthread_mutex_lock(pthread_mutex_t*)        test    eax, eax        jne     .L91                                          # mutex error handling        mov     rdi, rsp                                      # pointer to _mutex again        call    __gthrw_pthread_mutex_unlock(pthread_mutex_t*)        sub     rbx, 1        jne     .L80                                        # }while(--counter)

xor bl, 1 inside the loop would be irrelevant; out-of-order exec could overlap that with other work.

If a reference to var escaped the function so the compiler had to have it in sync in memory before non-inline function calls (including to pthread library functions), we'd expect something like xor byte ptr [rsp+8], 1. That would also be pretty cheap and perhaps mostly hidden by out-of-order exec, although the load/ALU/store could be something that a full barrier would have to wait for when draining the store buffer.

Speeding up your `std::atomic` code:

You're intentionally avoiding doing an atomic RMW, it seems, instead loading into a tmp var and doing a separate store. If you use only release instead of seq_cst, that lets it compile to just a plain store instruction on x86. (Or to cheaper barriers on most other ISAs).

      bool tmp = _value.load(std::memory_order_relaxed);   // or acquire      _value.store(!tmp, std::memory_order_release);

This should run at about 6 cycles per inversion, just the latency of one ALU operation plus store-forwarding latency for the store/reload. vs. maybe 33 cycles per iteration for the best-case throughput for mfence (https://uops.info/).

Or since this is a non-atomic modification, just store alternating values without re-reading the old value. You can usually only avoid an atomic RMW in cases where only one producer is writing a value, and other threads are reading. So let the producer keep the value it's modifying in a register (non-atomic local var), and store copies if it.

   bool var = true;   for(size_t counter = 0; counter < CYCLES; counter++)   {      var = !var;      _value.store(var, std::memory_order_release);   }

Also, don't use leading underscores for your own var names. Such names are reserved for the implementation. (single _ with lowercase is only reserved at file / global scope, but it's still bad practice.)

Answer by Peter Cordes for Relative performance of std::atomic and std::mutex

Speeding up your `std::atomic` code:

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

Speeding up your std::atomic code:

Trending Articles

Speeding up your `std::atomic` code: