Optimal multithread model
james at mansionfamily.plus.com
Tue Mar 16 07:15:41 CET 2010
Brandon Black wrote:
> However, the thread model as typically used scales poorly across
> multiple CPUs as compared to distinct processes, especially as one
> scales up from simple SMP to the ccNUMA style we're seeing with
This is just not true.
> large-core-count Opteron and Xeon -based machines these days. This is
> mostly because of memory access and data caching issues, not because
> of context switching. The threads thrash on caching memory that
> they're both writing to (and/or content on locks, it's related), and
> some of the threads are running on a different NUMA node than where
> the data is (in some cases this is very pathological, especially if
> you haven't had each thread allocate its own memory with a smart
This affects processes too. The key aspect is how much data is actively
the different threads *and is changing*. The heap is a key point of
contention but thread
caching heap managers are a big help - and the process model only works
when you don't
need to share the data.
Large core count Opteron and Xeon chips are *reducing* the NUMA affects with
modest numbers of threads because the number of physical sockets is
and the integration on a socket is higher, and when you communicate
coprocesses you'll have the same data copying - its not free.
Sure - scaling with threads on a 48 core system is a challenge. Scaling
on a glueless
8 core system (or on a share of a host with more cores) is more relevant
to what most
of us do though.
Clock speed isn't going anywhere at the moment, but core count is - and
so is the
available LAN performance. I think
> In the case that you can either (a) just use threads instead of event
> loops, but you still want one process per core and several threads
No, I think one process per core is not necessarily a smart way to
partition things - one
per NUMA area, sure. Once you've gone to one per core you can no longer
compute across multiple cores (even to do some crypto while going back
to IO, which
is a loss.
> You can make threads scale up well anyways by simply designing your
> multi-threaded software to not contend on pthread mutexes and not
> having multiple threads writing to the same shared blocks of memory,
> but then you're effectively describing the behavior of processes, and
> you've implemented a multi-process model by using threads but not
No. I start with 'its shared' and design for parallel execution without
contention, and I can
pass arbitrary data structures around and freely reference complex
arbitrary structures that
are static. That's a long way different from sharing only a small flat
memory resource and
some pipe IPC. Its more convenient and can be a lot faster.
> using most of the defining features of threads. You may as well save
> yourself some sanity and use processes at that point, and have any
No, I do it to stay sane. I let the CPUs handle the infrequent on-demand
of the shared data structures instead of having to marshall and stream
and the real memory usage resulting is much smaller, and I have much
semantics where I have delegated work to compute tasks.
> shared read-only data either in memory pre-fork (copy-on-write, and
> write never happens to these blocks), or via mmap(MAP_SHARED), or some
> other data-sharing mechanism. So If you've got software that's
> scaling well by adding threads as you add CPU cores, you've probably
> got software that could have just as efficiently been written as
> processes instead of threads, and been less error-prone to boot.
Subprocesses have their place and they do decouple things well. I'm not
dispute that a bunch of completely independant threads in process
no shared state at all can run with zero contention (there's no shared
state after all)
and be faster. But faster at what? We're talking about scaling a single
'thing' and that implies that the boss process is handing out work in a
streams all the requests and results *and* all the shared information
the broken out tasks. Sometimes you can do that, but often its rather
hard and the
complexity hurts development and the copying hurts runtime.
You are trying to deny-away that threading scales remarkably well if
Perhaps you think that no Java or .Net processes scale well on 4- or 8-core
systems? I'm not going to argue that 'Java is as fast as C' but its
close at some things and performance of Netty-based systems can be good, and
Jetty is pretty handy.
Let me ask you - how do you think memcached should have scaled past their
original single-thread performance? libevent isn't massively slow with
numbers of connections, and theer's a lot of shared state. And that's
you consider running a secure or authenticated connection.
More information about the libev