Feature request: ability to use libeio with multiple event loops

Yaroslav yarosla at gmail.com
Mon Jan 2 10:29:39 CET 2012

Hi everybody,

I've been following this discussion from the very beginning cause I'm also
trying to learn. The topic seems to be very interesting.

What I've learned so far is that there are extra performance costs
associated with threads:
(1) many libc calls (specifically malloc) in threaded library use locks
(mutexes), which makes them much slower
(2)  on multiple cpu cores each core's cache (L1/L2) needs to be
specifically synchronized cause it's not automatically shared
(3) MMU can not be used to store process-specific data (state), which leads
to extra indirection when accessing that data

Point (1) is perfectly clear to me.

About point (2) I have questions:
(2.1) At what moments exactly this syncronizations occur? Is it on every
assembler instruction, or on every write to memory (i.e. on most variable
assignments, all memcpy's, etc.), or is it only happening when two threads
simultaneously work on the same memory area (how narrow is definition of
the area?)? Perhaps there are some hardware sensors indicating that two
cores have same memory area loaded into their caches?
(2.2) Are there ways to avoid unnecessary synchronizations (apart from
switching to processes)? Because in real life there are only few variables
(or buffers) that really need to be shared between threads. I don't want
all memory caches re-synchronized after every assembler instruction.

Point (3) is completely unclear to me. What kind of process data is this
all about? How often is this data need to be accessed?

In addition to the above I have some questions related to
thread-local-storage (TLS). In particular: gcc's __thread storage class
used in Linux on x86/64:
(4.1) Can TLS be used as a means of _unsharing_ thread memory, so there are
no synchronizations costs between cpu cores?
(4.1) Does TLS impose extra overhead (performance cost) compared to regular
memory storage? Is it recommended to use in performance-concerned code?

For example, I have a program that has several threads, and each thread has
some data bound to that thread, e.g. event loop structures. Solution number
one: pass reference to this structure as a parameter to every function
call; solution number two: store such reference in TLS variable and access
it from every function that needs to reference the loop. Which solution
will work faster?

I'd like to thank everyone and Marc in particaular for this very
interesting discussion.

Yaroslav Stavnichiy

On Sat, Dec 31, 2011 at 3:42 PM, Marc Lehmann <schmorp at schmorp.de> wrote:

> On Thu, Dec 22, 2011 at 02:53:52PM +0100, Hongli Lai <hongli at phusion.nl>
> wrote:
> > I know that, but as you can read from my very first email I was planning
> on
> > running I threads, with I=number of cores, where each thread has 1 event
> > loop. My question now has got nothing to do with the threads vs events
> > debate. Marc is claiming that running I *processes* instead of I threads
> is
> > faster thanks to MMU stuff and I'm asking for clarification.
> I thought I explained this earlier, and I a not sure I can make it any
> clearer.
> Just try, mentally, to imagine what happens on your cache when you access
> a mutex, or mmap/munmap some memory (e.g. as a result of free), in the
> presense of concurrently executing threads.
> Now imagine you have far-away cpus, where it is beneficial to have per-cpu
> memory pools, e.g. in systems with higher number of cores or good old
> multi-cpu systems.
> Your cache lines bounce around, and memory is slow, or there will be
> ipmi's.
> Maybe the dthreads paper mentioned earlier explains this better, as they
> also have real-world data where unsharing memory and joining it later can
> have substantial performance benefits.
> Maybe it is just too obvious to me: memory isn't shared between cores in
> the
> hardware level, where memory means cache and the main memory is some
> distant
> slow storage device with complex and slow coherency protocols to give you
> the
> illusion of shared memory.
> It's a bit like using (physical) disk files to exchange data instead of
> using memory. It is going to be slower, and vastly more complex to keep
> synchronised.
> I think the problem is vice versa - whoever claims that threads are as
> fast as processes on *different* cores or cpus has to explain how this
> can be possible - for every design using threads I think I cna give a
> faster design using process, because processes can also share memory (but
> I wished it was easier).
> --
>                The choice of a       Deliantra, the free code+content MORPG
>      -----==-     _GNU_              http://www.deliantra.net
>      ----==-- _       generation
>      ---==---(_)__  __ ____  __      Marc Lehmann
>      --==---/ / _ \/ // /\ \/ /      schmorp at schmorp.de
>      -=====/_/_//_/\_,_/ /_/\_\
> _______________________________________________
> libev mailing list
> libev at lists.schmorp.de
> http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schmorp.de/pipermail/libev/attachments/20120102/a3ed8ee5/attachment.html>

More information about the libev mailing list