Feature request: ability to use libeio with multiple event loops

Hongli Lai hongli at phusion.nl
Mon Jan 2 11:05:57 CET 2012


On Mon, Jan 2, 2012 at 10:29 AM, Yaroslav <yarosla at gmail.com> wrote:
> (2.1) At what moments exactly this syncronizations occur? Is it on every
> assembler instruction, or on every write to memory (i.e. on most variable
> assignments, all memcpy's, etc.), or is it only happening when two threads
> simultaneously work on the same memory area (how narrow is definition of the
> area?)? Perhaps there are some hardware sensors indicating that two cores
> have same memory area loaded into their caches?

As far as I know (correct me if I'm wrong), when you execute a CPU
instruction that writes to a memory location that's also cached by
another CPU core, will cause the system to execute its cache coherence
protocol. This protocol will invalidate the cache lines in other CPU
cores for this memory location. MESI is a protocol that's in wide use.
See http://www.akkadia.org/drepper/cpumemory.pdf

You can see your multi-core system as a network of computers. You can
see each CPU cache as a computer's memory, and the main memory as a
big NFS server.


> (2.2) Are there ways to avoid unnecessary synchronizations (apart from
> switching to processes)? Because in real life there are only few variables
> (or buffers) that really need to be shared between threads. I don't want all
> memory caches re-synchronized after every assembler instruction.

If other CPU cores' caches do not have this memory location cached,
then the CPU does not need to do work to ensure the other caches are
coherent. In other words, just make sure that you don't read or write
to the same memory locations in other threads or processes. Or, if you
do read from the same memory locations in other threads, you shouldn't
have more than at most 1 writer if you want things to be fast. See
"Single Writer Principle":
http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html

When I say "memory locations" I actually mean cache lines. Memory is
cached in blocks of usually 64 bytes. Google "false sharing".

That said, locks are still necessary. Locks are usually implemented
with atomic instructions, but they're already fairly expensive and
will continue to become more expensive as CPU vendors add more cores.
The JVM implements what they call "biased locking" which avoids atomic
instructions if there's little contention on locks:
http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html
The benchmark on the above post turned out to be wrong; he fixed the
benchmark here:
http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html
As you can see, biased locking results in a huge performance boost.
Unfortunately I don't know any pthread library that implements biased
locking.


> Point (3) is completely unclear to me. What kind of process data is this all
> about? How often is this data need to be accessed?

This depends on your application. In
http://lists.schmorp.de/pipermail/libev/2011q4/001663.html I presented
two applications, one using threads and one using child processes. The
one using child processes can store its working data in global
variables. These global variables have a constant address, so
accessing them only takes 1 step. In the application that uses
threads, each thread has to figure out where its working data is by
first dereferencing the 'data' pointer. That's two steps. You can't
use global variables in multithreaded applications without locking
them (which makes things slow). In multiprocess software you don't
have to lock because processes don't share memory.

However I don't think this really makes that much of a difference in
practice. Using global variables is often considered bad practice by
many people. I tend to avoid global variables these days, even when
writing non-multithreaded software, because not relying on global
variables automatically makes my code reentrant. This makes it easy to
use my code in multithreaded software later, and makes things easier
to test in unit tests and easier to maintain.


> (4.1) Can TLS be used as a means of _unsharing_ thread memory, so there are
> no synchronizations costs between cpu cores?

Yes. Though of course you can still do strange things such as passing
a pointer to a TLS variable to another thread, and have the other
thread read or write to it. Just don't do that.

You don't necessarily need to use __thread for that. The 'void *data'
argument in pthread_create's callback is also (conceptually)
thread-local storage. You can store your thread-local data in there.


> (4.1) Does TLS impose extra overhead (performance cost) compared to regular
> memory storage? Is it recommended to use in performance-concerned code?

It requires an extra indirection. The app has to:
1. Figure out which thread it's currently running on.
2. Look up the location of the requested data based on the current thread ID.
How fast this is and whether there's any locking involved depends on
the implementation. I've benchmarked __thread in the past and
glibc/NTPL's implementation seems pretty fast. It was fast enough for
the things I wanted to do it with, though I didn't benchmark how fast
it is compared to a regular pointer access.


> For example, I have a program that has several threads, and each thread has
> some data bound to that thread, e.g. event loop structures. Solution number
> one: pass reference to this structure as a parameter to every function call;
> solution number two: store such reference in TLS variable and access it from
> every function that needs to reference the loop. Which solution will work
> faster?

This really depends on the TLS implementation. You should benchmark it.

-- 
Phusion | Ruby & Rails deployment, scaling and tuning solutions

Web: http://www.phusion.nl/
E-mail: info at phusion.nl
Chamber of commerce no: 08173483 (The Netherlands)



More information about the libev mailing list