Optimal multithread model

James Mansion james at mansionfamily.plus.com
Tue Mar 16 07:15:41 CET 2010

Brandon Black wrote:
> However, the thread model as typically used scales poorly across
> multiple CPUs as compared to distinct processes, especially as one
> scales up from simple SMP to the ccNUMA style we're seeing with
This is just not true.
> large-core-count Opteron and Xeon -based machines these days.  This is
> mostly because of memory access and data caching issues, not because
> of context switching.  The threads thrash on caching memory that
> they're both writing to (and/or content on locks, it's related), and
> some of the threads are running on a different NUMA node than where
> the data is (in some cases this is very pathological, especially if
> you haven't had each thread allocate its own memory with a smart
> malloc).
This affects processes too.  The key aspect is how much data is actively 
shared between
the different threads *and is changing*. The heap is a key point of 
contention but thread
caching heap managers are a big help - and the process model only works 
when you don't
need to share the data.

Large core count Opteron and Xeon chips are *reducing* the NUMA affects with
modest numbers of threads because the number of physical sockets is 
going down
and the integration on a socket is higher, and when you communicate 
between your
coprocesses you'll have the same data copying - its not free.

Sure - scaling with threads on a 48 core system is a challenge. Scaling 
on a glueless
8 core system (or on a share of a host with more cores) is more relevant 
to what most
of us do though.

Clock speed isn't going anywhere at the moment, but core count is - and 
so is the
available LAN performance. I think
> In the case that you can either (a) just use threads instead of event
> loops, but you still want one process per core and several threads
No, I think one process per core is not necessarily a smart way to 
partition things - one
per NUMA area, sure. Once you've gone to one per core you can no longer 
compute across multiple cores (even to do some crypto while going back 
to IO, which
is a loss.

> You can make threads scale up well anyways by simply designing your
> multi-threaded software to not contend on pthread mutexes and not
> having multiple threads writing to the same shared blocks of memory,
> but then you're effectively describing the behavior of processes, and
> you've implemented a multi-process model by using threads but not
No. I start with 'its shared' and design for parallel execution without 
contention, and I can
pass arbitrary data structures around and freely reference complex 
arbitrary structures that
are static. That's a long way different from sharing only a small flat 
memory resource and
some pipe IPC. Its more convenient and can be a lot faster.
> using most of the defining features of threads.  You may as well save
> yourself some sanity and use processes at that point, and have any
No, I do it to stay sane. I let the CPUs handle the infrequent on-demand 
of the shared data structures instead of having to marshall and stream 
them myself,
and the real memory usage resulting is much smaller, and I have much 
easier rendezvous
semantics where I have delegated work to compute tasks.
> shared read-only data either in memory pre-fork (copy-on-write, and
> write never happens to these blocks), or via mmap(MAP_SHARED), or some
> other data-sharing mechanism.  So If you've got software that's
> scaling well by adding threads as you add CPU cores, you've probably
> got software that could have just as efficiently been written as
> processes instead of threads, and been less error-prone to boot.
Subprocesses have their place and they do decouple things well. I'm not 
going to
dispute that a bunch of completely independant threads in process 
containers with
no shared state at all can run with zero contention (there's no shared 
state after all)
and be faster.  But faster at what? We're talking about scaling a single 
'thing' and that implies that the boss process is handing out work in a 
way that
streams all the requests and results *and* all the shared information 
needed by
the broken out tasks. Sometimes you can do that, but often its rather 
hard and the
complexity hurts development and the copying hurts runtime.

You are trying to deny-away that threading scales remarkably well if 
done properly.
Perhaps you think that no Java or .Net processes scale well on 4- or 8-core
systems? I'm not going to argue that 'Java is as fast as C' but its 
close at some things and performance of Netty-based systems can be good, and
Jetty is pretty handy.

Let me ask you - how do you think memcached should have scaled past their
original single-thread performance? libevent isn't massively slow with 
numbers of connections, and theer's a lot of shared state. And that's 
even before
you consider running a secure or authenticated connection.


More information about the libev mailing list