Feature request: ability to use libeio with multiple event loops

Colin McCabe cmccabe at alumni.cmu.edu
Mon Jan 2 23:49:19 CET 2012


On Mon, Jan 2, 2012 at 3:20 AM, Yaroslav <yarosla at gmail.com> wrote:
>> As far as I know (correct me if I'm wrong), when you execute a CPU
>> instruction that writes to a memory location that's also cached by
>> another CPU core, will cause the system to execute its cache coherence
>> protocol. This protocol will invalidate the cache lines in other CPU
>> cores for this memory location.
>
>
>> When I say "memory locations" I actually mean cache lines. Memory is
>> cached in blocks of usually 64 bytes. Google "false sharing".
>
>
> If this is right then there should be no advantage of processes over threads
> considering point (2). Cache invalidations will only occur for the data that
> really needs to be shared, not for all program's data. Whether you use easy
> thread-memory-sharing or not-so-easy-inter-process-memory-sharing, in any
> case there would be cache collisions among the cores whenever you write to
> shared variables.

The problem is that there's no way for the programmer to distinguish
"the data that really needs to be shared" from the data that shouldn't
be shared between threads.  Even in C/C++, all you can do is insert
padding and hope for the best.  And you'll never know how much to
insert, because it's architecture specific.

malloc doesn't allow you to specify which threads will be accessing
the data.  It's quite possible that the memory you get back from
malloc will be on the same cache line as another allocation that a
different thread got back.  malloc implementations that use per-thread
storage, like tcmalloc, can help a little bit here.  But even then,
some allocations will be accessed by multiple threads, and they may
experience false sharing with allocations that are intended to be used
by only one thread.

> If there are no cache invalidations when two threads _read_ from the same
> memory location, then threads even offer an advantage of saving memory
> footprint as read-only data is not duplicated (if I understand it right).

Threads do have an advantage in that you won't be loading the ELF
binary twice, which will save some memory.  However, even when using
multiple processes, shared libraries will only be loaded once.

For most programmers, this discussion is academic because their
platform forces their choices.  Java programmers pretty much have to
use threads, because loading multiple instances of the JVM is very
resource-intensive.  Windows programmers have to deal with the fact
that the overhead of using multiple processes on Windows is
unacceptably high.  On the other hand, some languages like Python have
only minimal support for multi-threading in the first place, leading
to multi-process solutions even when performance is not an issue.

Colin


> On Mon, Jan 2, 2012 at 2:05 PM, Hongli Lai <hongli at phusion.nl> wrote:
>>
>> On Mon, Jan 2, 2012 at 10:29 AM, Yaroslav <yarosla at gmail.com> wrote:
>> > (2.1) At what moments exactly this syncronizations occur? Is it on every
>> > assembler instruction, or on every write to memory (i.e. on most
>> > variable
>> > assignments, all memcpy's, etc.), or is it only happening when two
>> > threads
>> > simultaneously work on the same memory area (how narrow is definition of
>> > the
>> > area?)? Perhaps there are some hardware sensors indicating that two
>> > cores
>> > have same memory area loaded into their caches?
>>
>> As far as I know (correct me if I'm wrong), when you execute a CPU
>> instruction that writes to a memory location that's also cached by
>> another CPU core, will cause the system to execute its cache coherence
>> protocol. This protocol will invalidate the cache lines in other CPU
>> cores for this memory location. MESI is a protocol that's in wide use.
>> See http://www.akkadia.org/drepper/cpumemory.pdf
>>
>> You can see your multi-core system as a network of computers. You can
>> see each CPU cache as a computer's memory, and the main memory as a
>> big NFS server.
>>
>>
>> > (2.2) Are there ways to avoid unnecessary synchronizations (apart from
>> > switching to processes)? Because in real life there are only few
>> > variables
>> > (or buffers) that really need to be shared between threads. I don't want
>> > all
>> > memory caches re-synchronized after every assembler instruction.
>>
>> If other CPU cores' caches do not have this memory location cached,
>> then the CPU does not need to do work to ensure the other caches are
>> coherent. In other words, just make sure that you don't read or write
>> to the same memory locations in other threads or processes. Or, if you
>> do read from the same memory locations in other threads, you shouldn't
>> have more than at most 1 writer if you want things to be fast. See
>> "Single Writer Principle":
>>
>> http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html
>>
>> When I say "memory locations" I actually mean cache lines. Memory is
>> cached in blocks of usually 64 bytes. Google "false sharing".
>>
>> That said, locks are still necessary. Locks are usually implemented
>> with atomic instructions, but they're already fairly expensive and
>> will continue to become more expensive as CPU vendors add more cores.
>> The JVM implements what they call "biased locking" which avoids atomic
>> instructions if there's little contention on locks:
>>
>> http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html
>> The benchmark on the above post turned out to be wrong; he fixed the
>> benchmark here:
>>
>> http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html
>> As you can see, biased locking results in a huge performance boost.
>> Unfortunately I don't know any pthread library that implements biased
>> locking.
>>
>>
>> > Point (3) is completely unclear to me. What kind of process data is this
>> > all
>> > about? How often is this data need to be accessed?
>>
>> This depends on your application. In
>> http://lists.schmorp.de/pipermail/libev/2011q4/001663.html I presented
>> two applications, one using threads and one using child processes. The
>> one using child processes can store its working data in global
>> variables. These global variables have a constant address, so
>> accessing them only takes 1 step. In the application that uses
>> threads, each thread has to figure out where its working data is by
>> first dereferencing the 'data' pointer. That's two steps. You can't
>> use global variables in multithreaded applications without locking
>> them (which makes things slow). In multiprocess software you don't
>> have to lock because processes don't share memory.
>>
>> However I don't think this really makes that much of a difference in
>> practice. Using global variables is often considered bad practice by
>> many people. I tend to avoid global variables these days, even when
>> writing non-multithreaded software, because not relying on global
>> variables automatically makes my code reentrant. This makes it easy to
>> use my code in multithreaded software later, and makes things easier
>> to test in unit tests and easier to maintain.
>>
>>
>> > (4.1) Can TLS be used as a means of _unsharing_ thread memory, so there
>> > are
>> > no synchronizations costs between cpu cores?
>>
>> Yes. Though of course you can still do strange things such as passing
>> a pointer to a TLS variable to another thread, and have the other
>> thread read or write to it. Just don't do that.
>>
>> You don't necessarily need to use __thread for that. The 'void *data'
>> argument in pthread_create's callback is also (conceptually)
>> thread-local storage. You can store your thread-local data in there.
>>
>>
>> > (4.1) Does TLS impose extra overhead (performance cost) compared to
>> > regular
>> > memory storage? Is it recommended to use in performance-concerned code?
>>
>> It requires an extra indirection. The app has to:
>> 1. Figure out which thread it's currently running on.
>> 2. Look up the location of the requested data based on the current thread
>> ID.
>> How fast this is and whether there's any locking involved depends on
>> the implementation. I've benchmarked __thread in the past and
>> glibc/NTPL's implementation seems pretty fast. It was fast enough for
>> the things I wanted to do it with, though I didn't benchmark how fast
>> it is compared to a regular pointer access.
>>
>>
>> > For example, I have a program that has several threads, and each thread
>> > has
>> > some data bound to that thread, e.g. event loop structures. Solution
>> > number
>> > one: pass reference to this structure as a parameter to every function
>> > call;
>> > solution number two: store such reference in TLS variable and access it
>> > from
>> > every function that needs to reference the loop. Which solution will
>> > work
>> > faster?
>>
>> This really depends on the TLS implementation. You should benchmark it.
>>
>> --
>> Phusion | Ruby & Rails deployment, scaling and tuning solutions
>>
>> Web: http://www.phusion.nl/
>> E-mail: info at phusion.nl
>> Chamber of commerce no: 08173483 (The Netherlands)
>>
>> _______________________________________________
>> libev mailing list
>> libev at lists.schmorp.de
>> http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev
>
>
>
> _______________________________________________
> libev mailing list
> libev at lists.schmorp.de
> http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev



More information about the libev mailing list