Optimal multithread model

Brandon Black blblack at gmail.com
Tue Mar 16 14:47:49 CET 2010

On Tue, Mar 16, 2010 at 12:15 AM, James Mansion
<james at mansionfamily.plus.com> wrote:
> Brandon Black wrote:
>> However, the thread model as typically used scales poorly across
>> multiple CPUs as compared to distinct processes, especially as one
>> scales up from simple SMP to the ccNUMA style we're seeing with
> This is just not true.

I think it is, "as typically used".  Again, you can threaded write
software that doesn't use mutexes and doesn't write to the same memory
from two threads, but then you're not really using much of threads.
One of my major projects right now is structured that way actually, so
I certainly see your argument and use it :).  However, using threads
this way isn't much different from using processes.  The advantage to
using threads over processes in that case is that you don't have to go
around making SysV shm or mmap(MAP_SHARED) areas for the data you
intend to share, and certain other aspects of the OS interface (like
how you cleanly start and stop the groups of processes, how you handle
signals, etc) are simplified.

With either model you can still take advantage of pthread mutexes and
condition variables and so-on (see  pthread_mutexattr_setpshared(),
etc) once you've established your explicitly-shared memory regions, if
those mechanisms make the most sense for you.  That's really the crux:
explicit sharing where necessary, versus implicitly sharing everything
and relying on programmer discipline to avoid Bad Things.

>> large-core-count Opteron and Xeon -based machines these days.  This is
>> mostly because of memory access and data caching issues, not because
>> of context switching.  The threads thrash on caching memory that
>> they're both writing to (and/or content on locks, it's related), and
>> some of the threads are running on a different NUMA node than where
>> the data is (in some cases this is very pathological, especially if
>> you haven't had each thread allocate its own memory with a smart
>> malloc).
> This affects processes too.  The key aspect is how much data is actively
> shared between
> the different threads *and is changing*. The heap is a key point of
> contention but thread
> caching heap managers are a big help - and the process model only works when
> you don't
> need to share the data.

We agree on this point up until your last statement.  The process
model shares data just as well as the thread model, you just have to
explicitly state what is shared.

> Large core count Opteron and Xeon chips are *reducing* the NUMA affects with
> modest numbers of threads because the number of physical sockets is going
> down
> and the integration on a socket is higher, and when you communicate between
> your
> coprocesses you'll have the same data copying - its not free.
> Sure - scaling with threads on a 48 core system is a challenge. Scaling on a
> glueless
> 8 core system (or on a share of a host with more cores) is more relevant to
> what most
> of us do though.
> Clock speed isn't going anywhere at the moment, but core count is - and so
> is the
> available LAN performance. I think

I see this very differently.  We agree that clock speeds are going
nowhere, leading to higher core counts per die.  Where we're at now is
that a cheap-ish 1U server may well have 2 NUMA nodes with 2-4 cores
each, and both the core count per NUMA node and the total number of
NUMA nodes is likely to increase going forward.  NUMA is going to
become a single-server scaling problem, even for small servers.  It's
the only way a small server that costs a fixed $X is going to continue
to scale up in performance over the years now.  Either they're going
to both increase core count slightly and increase the count of
(smaller) dies in a small system, having each die be a NUMA node, or
stuff 16+ cores in a die and try to stick with just 2 dies in the
system, in which case they put some NUMA hierarchy inside each die.
But either way, there's going to be more NUMA involved as we add more
cores to a single system.  We've been down the 16 (or more) -way
uniform-memory-access SMP road years ago, it simply doesn't scale.

[... skipped some of the rest, relatively minor points and this debate
is getting long ...]
> are static. That's a long way different from sharing only a small flat
> memory resource and
> some pipe IPC. Its more convenient and can be a lot faster.

Processes *can* do everything threads do.  Even the exact same forms
of memory sharing and IPC, with the same efficiencies.  The pthread
API itself is usable across process boundaries, or you can do other
equivalent things.

[ ...]
> Let me ask you - how do you think memcached should have scaled past their
> original single-thread performance? libevent isn't massively slow with
> modest
> numbers of connections, and theer's a lot of shared state. And that's even
> before
> you consider running a secure or authenticated connection.

There are two levels of performance problem we're talking about in
general.  One is scaling up on a single core, which means avoiding
blockage.  The other is scaling a good single-CPU solution over
multiple cores.  Threads and event libraries are two ways of
accomplishing the first, and threads and processes are two ways of
accomplishing the second.  Therefore on a multi-CPU system, your
aggregate choices are:

1) process-per-core with an eventloop per process: arguably the most efficient
2) process-per-core with threads within each process: slightly less efficient
3) thread-per-core with an eventloop per thread: can be nearly as
efficient as (1) if very carefully designed such that the threads
greatly resemble the behavior of processes, but is more prone to
difficult-to-debug bugs and race conditions.
4) All threads.  At this point thread count is pretty much orthogonal
to core count as you're using them to solve both problems.

So if we're talking about scaling up memcached (or anything of similar
architecture) performance on a single CPU core, an event library like
libev would have been best.  The next best thing on a single core
would've been threads.  Over multiple cores, you clearly need multiple
threads of execution.  Those threads of execution could be either
processes or threads.  With threads, everything would be shared by
default, and it would be very easy for the developers to make mistakes
that led to race conditions, rare bugs, and poor performance.  WIth
processes, you would have explicitly shared the pool of memory
containing the memcached data only, and the chances for
threads-related bugs/slowdowns related to all other heap data in
memcached would be significantly reduced.  Within the process-per-core
you've created, you'd use either events (preferred) or threads to
avoid blocking up on that one core.  Either can perform just as well
and have the same capabilities for shared data and synchronization.

-- Brandon

More information about the libev mailing list