Optimal multithread model

Marc Lehmann schmorp at schmorp.de
Thu Mar 25 20:42:43 CET 2010

On Thu, Mar 25, 2010 at 10:18:39AM +0100, Gabriel Kerneis <kerneis at pps.jussieu.fr> wrote:
> Temporal graphs were indeed fascinating for thttpd. Compare for
> instance:
> http://www.pps.jussieu.fr/%7Ekerneis/software/files/benchmark/graphs/thttpd-time-1000.png
> http://www.pps.jussieu.fr/%7Ekerneis/software/files/benchmark/graphs/st-time-1000.png

Hmm, batching at work. Thanks, looks interesting.

> > that it uses some rather clever data structures (e.g. heaps for storing
> > timeouts)".
> And that benchmark didn't use timeouts anyway, so this sentence was a
> plain wrong way of acknowledging our lack of understanding ;-)

Right - but a lot of high-profile event libs (glib) uses unordered lists
for everything, so it's still pretty clever compared to the average.

> > I am looking for alternatives and improvements (and will implement some
> > in one of the next releases) to (binary) heaps in libev, and it is a bit
> > shokcing ot read that this is a clever data structure - I always thought
> > of it as being the lower acceptable minimum...
> Well, if every timeout has the same value (which is not uncommon on
> server applications), it is more efficient to use an ordered list, with
> a direct pointer to the head and the tail.

Indeed, that's why the libev manual recommends this for constant timeouts.

> The problem is that it is very efficient as long as its usable, and
> suddendly freezes your app if you have too many elements in it, using
> several timeout values.  So this is not acceptable for a general purpose
> library.

Unfortunately, yes.

The algorithm I use in an event loop written in pure perl (where heaps are
rather slow) is to store changed timers in an unsorted array, and merge this
array int the sorted timer array by making a full sort (I only need to
remember the minimum time for the netx timer in the unsorted list).

This turned out to be quite efficient, obviously it can be quite
cache-efficient (better than heaps) and since the sorting algorithm used
by perl works well with partially-sorted data, also quite efficient
(better than the O/n log n) of a binary heap).

The idea would be to use this algorithm for libev as well, maybe still using
the heap for sorting, to ensure sensible runtime bounds.

But I am still wondering whether a straight sort could be made to perform
better in general.

(This data structure retains all advantages of using arrays, everything
can be done in-place and so on, no pointers needed).

I'd like to have some nice real-time timer benchmarks... (or timer data
:), because it's always nice to see if the complication actually saves
something - but I think it can be made so it doesn't get slower except by
some low constant overhead per operation.

Building the heap in one operation as opposed to doing single heap
operations on each add/delete is also more cache-efficient.

> >    Thus, event-driven programming avoids the localised cost associated
> >    to thread contexts, but pays a price that is spread throughout the
> >    program and hence dicult to quantify.
> > 
> > That is true for threads, too - C doesn't have registers, so it has to
> I am not sure to understand what you mean here (admittedly because I do
> not know much about performance issues with threads - I didn't wrote
> that part of the paper).  I'll think about it, but if you could rephrase
> it, maybe it would enlighten me.

Well, the quote kind of says that storing everything on the heap is
expensive - each time a callback is invoked, it first needs to fetch the
working values from heap into the cpu registers, while with threads, they are
already inside the cpu.

But threads usually look like this:

   struct connection {
      many members;

   send_header (connection);

   static void send_header (struct connection *c)
      if (c->headercnt...

That is, state is usually on the heap as well, as registers are scarse and
usually are not kept through function calls (ignoring inlining).

So the same overhead exists for threads as well. The question is therefore
which method suffers more (or less :), which doesn't affect what the paper
says elsewhere.

(For example, threads can be more cache-efficient but if you want that,
you need one thread/requesta gain, which in turn is less cache-efficient
and so on).

> > It's still giving me great input, and made me plan to look at those clever
> > optimisations inside ST - there is almost always something to learn in
> > good software.
> If you find out what makes ST really fast, please let me know, I'm still
> in the dark here.

From what I saw, it uses relatively simple data structures, so it's
advantage might come form simply being, eh, simple (like binary heaps
beating more scalable data structures on low-enough members because they
can use arrays and so on).

libev uses this methodology extensively - almost everything is a
dynamically-resized array (with some fixed-size hash tables), which saves
memory and makes it fast. More scalable solutions can be done, but if they
are only becoming better at 200000 sockets they are not very relevant.

This does not answre what makes st so fast, however. Maybe it happens to
have the right data structrue size and sistribution to fit your cache
lines? Maybe it's scheduler uses a weird reverse-bouncing-method that
happens to work with your benchmark? Really hard nowadays to see this
without thorough redoing of the benchmarks together with tiresome code
anlysis :)

> I'll try to set up a dedicated web page to provide the scripts I used
> and the method (someone requested it off-list) but won't have to time to
> do so before two weeks.

Sounds cool :)

                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp at schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

More information about the libev mailing list