MSVC memory barrier issue.

Nicolas Noble nicolas at grumpycoder.net
Sat Jan 4 20:02:00 CET 2014


Hmmm, it seems to me that there's some kind of misunderstanding. So please
bear with me if I sound a bit condescending, I intend no offence if the
following is just the result of vocabulary mismatch, and the general
multithreading code of libev seems to me that the difference is actually
well understood and used. But in the event there's an actual misusage of
volatiles, I'd rather be sounding condescending than being sorry.

In a nutshell, what I want to say is that to POSIX - and Visual C's default
is that mode - volatile is merely an un-optimization request from the
programmer to the compiler. It in no means affects the internal state of
the CPU. And that was the crux of the problem here: internal CPU caches. I
will re-iter saying this several time over the course of that e-mail,
because it's really important in my opinion. Compiler-barrier and
CPU-barriers, while sounding very similar, aren't being issued the same
way. C's volatiles, or Visual C's _ReadWriteBarrier() are only
compiler-barriers. C11's _Atomics, or Visual C's MemoryBarrier() are
CPU-barriers.

Now I'll go over what's being discussed previously, in details.

The article
http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.90%29.aspx says
that volatiles work as memory barriers, which is true, but they work as
compiler-memory barriers, not as CPU-memory barriers. While it asks the
compiler to avoid optimizing around and to not re-order or cache reads and
writes to that variable, it won't affect the CPU's internal caches, and
reading or writing a volatile variable will not issue CPU synchronization
instructions.

In the example, they implement a spin lock semaphore using a volatile
boolean. Thread1 is busy-waiting on the boolean variable to become true,
and Thread2 will eventually set it to true. The trick however is that it's
busy looping, meaning that it'll eventually get around the CPU caches in an
SMP environment. But in that example, the volatile only has two roles:
ensuring that Thread1's generated assembly code doesn't cache the boolean
into a register, which would otherwise end up in an infinite loop in
Thread1, and ensuring that Thread2 will accomplish what it needed to do
before flipping the boolean to true, which would otherwise defeat the
purpose of the critical section. And that's it. However, when CPU2 that
holds Thread2 finishes writing the boolean in memory, the CPU1 that holds
Thread1 may still have the boolean to false in its cache for a little
longer. And that's how this code works: it just busy waits until the CPU
caches are properly propagated. And since the CPU should not be re-ordering
memory access either, the cache in CPU1 will properly be synchronized in
the order CPU2 wrote to it.

What I am saying is that volatiles are never enough to provide SMP
synchronization. That'll be of course enough in a hardware configuration
where you have only one single-core CPU, but if you have a modern,
multi-core CPU, each with an individual cache, extra work is needed to
ensure proper synchronization.

In the case of libev, here is what was happening. First, let us not forget
about the fact ECB_MEMORY_FENCE* were compiler-barriers only when using
Visual C's _ReadWriteBarrier(). Meaning that they wouldn't flush any CPU
cache. So they only ask the compiler not to re-order reads and writes
around, and not to synchronize CPUs.

In Thread1, CPU1, we would start executing ev_run. The ev_run's code would
first set pipe_write_wanted to 1, and then, because of the volatility of
the variables, and because of the memory fence, would start reading various
other variables, including pipe_write_skipped, for the purpose of the "if"
condition there. Again: I insist, because of the nature of the memory fence
used, no extra code is generated to synchronize CPUs.

Then, in Thread2, CPU2, we would start executing evpipe_write. We first set
pipe_write_skipped to 1, and head over a few lines below to read
pipe_write_wanted, in that order, because of the volatility and the
compiler-barrier. Still without CPU synchronization.

The end result for the actual race condition relies on the CPU caches: when
CPU1 and CPU2 both execute their code in parallel, they both have
pipe_write_skipped and pipe_write_wanted in cache, probably because of
previous iterations. When CPU1 writes pipe_write_wanted and CPU2 writes
pipe_write_skipped, they both go on to reading the other variable at the
same time from their respective caches. And they both return 0, resulting
in the race condition, that'd be a hard deadlock, if not for libev's
MAX_BLOCKTIME.

Here's disassembly dumps from Visual Studio, with the previous macro
definition, and with the newer one:

Without MemoryBarrier():

--- d:\igor\balau\libev\ev.c
---------------------------------------------------

inline_speed void
evpipe_write (EV_P_ EV_ATOMIC_T *flag)
{
003BFF20  push        ebp
003BFF21  mov         ebp,esp
003BFF23  sub         esp,10h
003BFF26  mov         eax,dword ptr ds:[0079A000h]
003BFF2B  xor         eax,ebp
003BFF2D  mov         dword ptr [ebp-4],eax
003BFF30  push        ebx
003BFF31  mov         ebx,ecx
  ECB_MEMORY_FENCE; /* push out the write before this function was called,
acquire flag */

  if (expect_true (*flag))
003BFF33  cmp         dword ptr [edx],0
003BFF36  jne         evpipe_write+80h (03BFFA0h)
    return;

  *flag = 1;
003BFF38  mov         dword ptr [edx],1
  ECB_MEMORY_FENCE_RELEASE; /* make sure flag is visible before the wakeup
*/

  pipe_write_skipped = 1;
003BFF3E  mov         dword ptr [ebx+0E4h],1

  ECB_MEMORY_FENCE; /* make sure pipe_write_skipped is visible before we
check pipe_write_wanted */

  if (pipe_write_wanted)
003BFF48  cmp         dword ptr [ebx+0E0h],0
003BFF4F  je          evpipe_write+80h (03BFFA0h)


It's pretty visible here that the ECB_MEMORY_FENCE have emitted absolutely
no instruction whatsoever, and the writes and reads on pipe_write_skipped
and pipe_write_wanted are simple movs and cmps, without any memory bus lock.

With MemoryBarrier():

--- d:\igor\balau\libev\ev.c
---------------------------------------------------

inline_speed void
evpipe_write (EV_P_ EV_ATOMIC_T *flag)
{
0132FF20  push        ebp
0132FF21  mov         ebp,esp
0132FF23  sub         esp,20h
0132FF26  mov         eax,dword ptr ds:[0170A000h]
0132FF2B  xor         eax,ebp
0132FF2D  mov         dword ptr [ebp-4],eax
0132FF30  push        ebx
0132FF31  mov         ebx,ecx
  ECB_MEMORY_FENCE; /* push out the write before this function was called,
acquire flag */
0132FF33  xor         ecx,ecx
0132FF35  lea         eax,[ebp-18h]
0132FF38  lock or     dword ptr [eax],ecx

  if (expect_true (*flag))
0132FF3B  cmp         dword ptr [edx],ecx
0132FF3D  jne         evpipe_write+94h (0132FFB4h)
    return;

  *flag = 1;
0132FF3F  mov         dword ptr [edx],1
  ECB_MEMORY_FENCE_RELEASE; /* make sure flag is visible before the wakeup
*/
0132FF45  lea         eax,[ebp-10h]
0132FF48  lock or     dword ptr [eax],ecx

  pipe_write_skipped = 1;
0132FF4B  mov         dword ptr [ebx+0E4h],1

  ECB_MEMORY_FENCE; /* make sure pipe_write_skipped is visible before we
check pipe_write_wanted */
0132FF55  lea         eax,[ebp-8]
0132FF58  lock or     dword ptr [eax],ecx

  if (pipe_write_wanted)
0132FF5B  cmp         dword ptr [ebx+0E0h],ecx
0132FF61  je          evpipe_write+94h (0132FFB4h)

Here, the code goes a little bit different. Visual C decided to optimize
"zero" by clearing the ecx register, and subsequently uses it to issue
"lock or" on various locations on the stack, to force a CPU cache
synchronization. The write and read to pipe_write_skipped and
pipe_write_wanted are still done using simple movs and cmps, but are
preceded with bus locking operations, making sure the various cores all
have the same caches - since Intel CPUs are pretty scorched-earth-method
when it comes to cache-lines purging, a simple lock anywhere will do.


So, again, I want to emphasis that volatiles aren't atomics. They don't
guarantee CPUs caches synchronization, and merely is a contract between the
programmer and the compiler to not re-order operations as a result of
optimizations. Which is why C11 and C++11 introduced <stdatomic.h> and
<atomic> respectively. And Visual C's _ReadWriteBarrier() is the same as a
volatile compiler hint. Since the 2008 release, it no longer issues CPU
cache flushes, it simply is a broader volatile. Since Visual Studio 2008,
MemoryBarrier() is what should be seen as the equivalent of
gcc's __sync_synchronize().

As I said in the preamble, sorry if I sound condescending, but I rather
wanted to make sure we were on the same page than leaving any doubt around.

  -- Nicolas Noble



On Sat, Jan 4, 2014 at 1:45 AM, Marc Lehmann <schmorp at schmorp.de> wrote:

> On Sat, Jan 04, 2014 at 01:35:44AM -0800, Nicolas Noble <
> nicolas at grumpycoder.net> wrote:
> >  Basically, if you refer to the MSDN, (from this article, more precisely:
> > http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.90).aspx ) the
> > _ReadWriteBarrier() intrinsic is now only a compiler intrinsic that only
>
> This should be fine, as the required variables are marked as volatile, and
> according to this:
>
> http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.90%29.aspx
>
> volatile variables, as a compiler extension, already act as memory
> barries (and the article explicitly says using it to implement locks in
> multithreaded apps etc. is ok).
>
> >   For more details, if needed, I can provide a fully-featured
> reproduction
> > case tomorrow - it's 1:30am right now, and I'm only glad I finally found
> > this one - but here is some pseudo-code that exacerbates and demonstrates
> > the problem:
>
> This would then either be a compiler bug (as the required vars are
> volatile), or another bug unrelated to memory barriers (e.g. in libev
> itself, which would be scary, but looks very unlikely at this point).
>
> >   Another workaround I first found, that confirms the problem, is that
> I've
> > changed pipe_write_wanted to always be 1. Same exact code using the old
> > barrier code, just writing 1 in pipe_write_wanted instead of 0, in order
> to
> > force writing into the pipe. That should demonstrate that
> pipe_write_wanted
> > is getting desynchronised between the thread that runs evpipe_write and
> the
> > main thread that runs ev_run.
>
> Which it shouldn't be, as pipe_write_wanted is also marked as volatile,
> and gets acquire/release semantics.
>
> >   Let me know if you have further questions or concerns about this, or if
> > you really want a short reproduction code to demonstrate this - right now
> > my code is a full-blown application that wouldn't be very easy to use as
> a
> > test case.
>
> I don't really mind making libev slower for visual C to work around bugs
> in the compiler, so I'll gladly just apply your patch, whether its needed
> in a future version of the compiler or not :)
>
> Thanks for finding this bug, and providing a workaround.
>
> --
>                 The choice of a       Deliantra, the free code+content
> MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp at schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schmorp.de/pipermail/libev/attachments/20140104/593b47dc/attachment.html>


More information about the libev mailing list