Bug Report: SIGPIPE during call to ev_async_send
Benjamin Mahler
benjamin.mahler at gmail.com
Fri Oct 9 20:55:17 CEST 2015
Hi Marc,
First of all, thanks for the great library, the Mesos project has been
using heavily in production for many years. Recently, we've encountered a
rare issue in which SIGPIPE occurs during a call to ev_async_send.
The environment is CentOS 5 (no eventfd, and epolll backend). The call to
write() against evpipe[1] in evpipe_write triggers a SIGPIPE, indicating
that evpipe[0] is closed.
The bug is rare (on the order of 10 occurrences per day on production
clusters of tens of thousands of servers). We initially suspected that
there was a rogue close in the non-libev code (e.g. double close on an fd)
that was accidentally closing evpipe[0] due to fd reuse.
However, having ruled out bad close calls within our code, I started
digging into libev and noticed what appears to be the bug. Consider the
following situation:
Thread1 (in ev_run): backend_poll -> epoll_poll -> sets postfork = 1
Thread1 (in ev_run): return from backend_poll, but before setting
pipe_write_wanted = 0
Thread2 (in ev_async_send): evpipe_write -> sees pipe_write_wanted = 1 and
enters the if block
Thread1 (in ev_run): comes back to top of the loop and checks postfork,
calls loop_fork and closes evpipe[0] but does not yet call evpipe_init
Thread2 (in ev_async_send): evpipe_write continues and calls write() ->
SIGPIPE
To test this theory, we inserted usleep calls in the three locations to
increase the likelyhood of the race occurring during our test suite. Here
is a diff against ev.c revision 1.477:
$ diff ev.c ev.c.sleeps
2311a2312,2313
> usleep(1000);
>
2930a2933,2934
> usleep(10000);
>
3523a3528,3529
> usleep(10);
>
If easier, here is a gist in which you can find the injected sleeps by
looking for "usleep":
https://gist.github.com/bmahler/0dd5ade8083dbd9f94f7
Once these are injected, our test suite regularly trips the SIGPIPE:
W0100 00:00:00.000000 41361 main.cpp:38] RAW: Received signal SIGPIPE;
escalating to SIGABRT
*** Aborted at 1444415504 (unix time) try "date -d @1444415504" if you are
using GNU date ***
PC: @ 0x7fb365a75b6d raise
*** SIGABRT (@0xa191) received by PID 41361 (TID 0x7fb367984780) from PID
41361; stack trace: ***
@ 0x7fb365a75ca0 (unknown)
@ 0x7fb365a75b6d raise
@ 0x4e51d9 handler()
@ 0x7fb365a75ca0 (unknown)
@ 0x7fb365a74a2b __libc_write
@ 0x40ad3b evpipe_write.part.5
@ 0x77ce9d process::run_in_event_loop<>()
This seems to confirm the issue! With eventfd enabled the write will
silently fail rather than trigger SIGPIPE. This also seems problematic, yes?
Looking forward to hearing your thoughts Marc!
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schmorp.de/pipermail/libev/attachments/20151009/eff3a2b3/attachment.html>
More information about the libev
mailing list