Java Performance Services
Training, Seminars, Benchmarking, Tuning

Java Performance Tuning Course


Chania Crete
May 22rd-25th 2012

Sydney Australia
Feb 6-9, 2012

Montreal Canada
Feb 27-March 2, 2012

San Francisco CA
March 20-23 2012

Paris France
March 27-March 2, 2012

Request an in-house seminar

Anti-if

I have joined Anti-IF Campaign

Calendar

««May 2012»»
SMTWTFS
   12345
6789101112
13141516171819
20212223242526
2728293031

Performance Anti-Patterns

My Top Tags

                                       

Mailing List

My RSS Feeds








Non-blocking almost FIFO

posted Wednesday, 15 April 2009

I'll start with a warning that entry maybe a bit rambling as I'm using it to help sort out my thoughts more than anything else. As a little side project, I decided to implement a non-blocking almost FIFO. Some of you may point out that Doug Lea already has such an implementation and more over, it’s wait-free, a stricter form of non-blocking algorithms. And, the implementation is strict FIFO. So why would one want an almost FIFO non-blocking but not wait free implementation. The only point of weakness in Doug’s implementation is that it neither stripes reads nor writes.

The first round of concurrent collections to show up in the JDK relied on a technique known as lock stripping. Lock stripping is where you take one lock and replace it with many. For example, ConcurrentHashMap makes use of 32 locks in place of the 1 lock used by HashTable. WIth a proper hash function, ConcurrentHashMap will scale all the way out to 32 CPUs before it’s performance curve starts flattening off. The only way to do better is to get rid of the locks altogether. Though this capability is present in modern multi-core processors, it has not been exposed in Java with a standard API.

The current mechanism to eliminate locks is known as Compare and Swap (CAS). The CAS instruction requires that you know the value in memory and pass that as a parameter along with the new value. If you don’t know the value in memory, you will experience a CAS failure and the memory cell will not be mutated. If you get the value correct then in just about ever case the CAS will succeed. There are some spurious cases where what should be a successful CAS will fail. However, this are supposedly rare events.

CAS operations are exposed in the class sun.misc.Unsafe. This class is appropriately  named in that it exposes pointer manipulation capabilities in addition to the CAS functionality, If you are going to CAS a field in a class, you’re going to have to find it’s address in memory. Use of Unsafe exposes your application to C/C++ like pointer bugs. You want to tread lightly. Because of the potential problems with CAS, Doug Lea and others are working to produce a standard safer version of CAS.

The CAS instruction wrapped in AtomicReference.getAndSet() (the one I objected to in an early blog posting) is essential to how Doug Lea’s linked list (FIFO Queue) implementation works. In that implementation, the first to arrive will use getAndSet to slam down a pointer to the next node in the linked list. Thus you have strict FIFO. Or do you?

The reason I ask is, what does strict FIFO mean in a multi-core world? With more than 1 thread making forward progress, there is in effect a race to the head of the queue which is somewhat non-deterministic. In other words, in complex systems, it’s crap shoot as to which thread makes it to the head of any queue prior to any other thread. More over, most people don’t really care about first in first out. What they do care about is an ordering of when things get done. I often call this requirement, first in, first done. While that may have made sense in batch/single core, it makes no sense in multi-core. Again, this is a situation where many threads all making simultaneous forward progress are in a race for either the head or tail of the queue. When this happens, the best or most performant solution is to use application semantics to enforce ordering. This is possibly best understood with the assistance of an example.

The TCP network protocol gives us the capability to create a connection between two processes. On one end of the connection we can write out data that magically appears at the other end. The stream appears on the back end of the connection in the same order that we wrote it out on the front end. We can think of TCP as a strict FIFO queue in that first byte in will be the first byte out. But what of the underling implementation. TCP is built on top of IP. IP has no sense of connection and certainly no sense of order. It simply sends packets of bytes over a channel. Those packets may or may not arrive in the order they were sent. In fact, they may not arrive at all! There is no guarantees at that level of the protocol. IP isn’t even close to being FIFO. It is up to application (TCP in this case) semantics to provide a sense of order and completeness, not the underlying technology.

I have two more comments about the CAS’ing in Doug Lea’s linked list. I need to get an exact reference for this but in one of Cliff Click’s talks, he commented on the scalability of repeated writing to (and surprisingly) reading the same memory location (a.k.a. volatile). It’s not bad but it’s a potential limit to scalability. I can attest to this from my work with Cray’s. In Cray computers, bank cool off times were set to 4 clock ticks. This meant that before you could restrobe a memory location, you had to wait 4 clock ticks. thus memory was arranged that successive locations where found in different memory banks. Striding though memory by 1 or some number that was relatively prime to 2 was essential to good performance. In this case we could say that the hardware striped the writes for you.

The other comment is about CAS failures. If you fail, you have to do something to continue to make forward progress. The AtomicReference.getAndSet() method attempts to slam down a pointer in the next bucket when the CAS fails. By inspection this behavior would send to encourage a lot of CAS failures when working in a highly multi-threaded/multi-core environment. High numbers of CAS failures is also not desirable. But the only reason for this behavior is to enforce a strict FIFO behavior. If you’ve even bought have of the argument that strict FIFO doesn’t make sense at this level, then a means of reducing CAS failures becomes quite apparent. All we need to do is randomly stripe the reads and writes in the same way that HashMap’s hashcode method combined with the insert logic stripes it’s writes. If we do this, we clearly lose the FIFO characteristic of the queue. This is also not desirable.

Cliff Click talks about a style of coding that lends it’s self to non-blocking implementations. After some discussion with him about a non-blocking queue, he came up with the idea that a thread could be randomly assigned a slot in an array. The thread would CAS down the value and if that failed, would move sequentially onto the next slot.

This seems to work reasonably well until you start to think about resizing. WIthout getting into all of the details, the FIFO property of the queue is diluted as the queue gets bigger. Thus is seemed desirable to work out a resizing strategy that did not dilute the FIFO property as the queue got bigger. After a single implementations it became apparent that there was a natural separation between writers and readers and that they should be given different pointers to work with. Thus we could start with an array of a fixed size and then instead of increase that size, just keep adding new arrays to the data structure. As soon as the writers fill the current array, one will attempt a resize. The resize thread will simply create a new array and expose it to the other writers. Thus all writers should automatically move from the primary array to an overflow structure. This leaves the readers to clean out the primary array before promoting an overflow structure and starting to work with it. Thus the natural flow of readers chasing writers is maintained. Data in the queue will most likely not come out in the order that it was inserted but it should be fairly close or “almost FIFO”.

 

 
Block diagram of NBFIFO data structures 
 
Figure 1. Block Diagram of Supporting Pointers and Data Structures
 
 

What makes this non-blocking and not wait-free is that writers must wait for the overflow structure to be constructed and hooked up. This is achieved with a wait (as apposed to a lock) If the resize thread happens to get shot down, the waiting threads will wake up and one of them will attempt to resize. If we were to block, all threads could be live locked waiting for the resize to complete.

Another implementation detail to come out is the importance of order and cardinality of critical shared resources. I’ve not worked out a way to expose more than 1 critical pointer without putting the data structure into a precarious state. In my implementation I had to adjust things so that only one pointer for the readers and one pointer for the writers was critical. Since readers and writers are treated separately, it was permissible to have two critical shared resources. Cliff avoided this situation by embedding a control structure into the zeroth element of the underlying array. I chose to use a crippled link list implementation to manage overflow. The reader pointed to the element in the head while the writers pointed to the element in the tail.

The crippled implementation was necessary in order to get the pointer update ordering correct while avoiding extra levels of indirection. This maybe seem a bit confusing but it should be clear once I release the source.

In order to be compatible with the JDK specification for collections, I had to implement a number of methods that just were impossible. The most seemingly trivial was size(). I used Cliff’s non-blocking concurrent counter implementation as the alternative was even less palatable, walking the data structure counting non-null slots. That operation would not only be O(N), it would most likely be completely irrelevant by the time it finished. I’m not even going to discuss Iterator or toArray(). In fact, I’ve yet to implement toArray() as I fail to see it’s usefulness in cases where this implementation could be useful.

What is the goal of this exercise? Primarily it’s a very small baby step on the way to loosening the pessimistic locking model that Java has grown up with. This pessimistic model is akin to programming by exception. I might have a conflict so I have to take the most expensive protective action possible.

Programming by exception is a style of coding where our thoughts are dominated by exceptional cases instead of the intended function. Take the classic money transfer from one account to another. We obtain a lock on both accounts, take from one an put into another, release both locks. We do this just in case two threads both try to update one or other account while we are executing our transfer. We need to lock against the possibility of data corruption. That said, when we observe the behavior of running systems we find that contention is a exceptionally rare event. Yet we are forced to use locks because it can happen even though the probability is either zero or very close to zero. Database vendor learns decades ago that this style of coding is fragile and doesn’t scale.

So while I’ll make no grand claims about the usefulness or importance of the implementation (it is neither) that I’m about to release, I do hope that there are some lessons contained in the code that someone will find useful.

tags:          




1. Cliff Click left...
Thursday, 16 April 2009 4:23 pm :: http://blogs.azulsystems.com/cliff

Nice! I've pondered this very solution (just fill up an array and move on to the next when full) to a scalable near-FIFO queue. I think this solution only "fails" when the units of work are less than the amortized GC costs to allocate a single array element (i.e., really really tiny units of work).

Couple of suggestions:

Embed the links to the next array in the zero'th element and avoid the indirection.

Hash your initial probe into the array, then linear scan (with wrapping) to avoid cache-line contention when everybody hits the same spot. This is true for both readers AND writers, since readers must CAS-to-NULL elements they have pulled from the array.

Balance array size with scanning costs. Larger arrays allow more non-FIFO-ness and thus handle contention better, smaller arrays allow faster scans for a proper spot to insert & remove stuff. The new overflow arrays don't have to be the same size as the old ones; if you notice contention then you can make the new array larger (and vice-versa).

Cliff


2. Kirk Pepperdine left...
Thursday, 16 April 2009 4:44 pm

Hey Cliff, thanks for the comments. when I'm a wee bit more confident about the fencing I wanted to shoot a copy over to you for review.

I thought of embedding the zero'th element to avoid the indirection. But then the link list seemed more understandable. Plus the reader/writer references the array directly and not via the linked list which is why I called it a hacked up linked list.

humm, I'm using pure random but that could easily be replaced with hash. I do a linear scan and if I get back to the same point, time to advance to the overflow. readers do CAS to null, writers CAS from null.

I was concerned about the scanning cost which has now lead me to believe that fixed size overflows maintains a closer to FIFO behavior just as you've pointed out. So I think I've already taken your advice on this one but it's good to have a decision reaffirmed. I'm going to take your comment about adjusting overflow sizing and add it as a comment for the moment. There are some other tweaks in the implementation that I need to make prior to getting fancy ;-)

Kirk


3. Cliff Click left...
Friday, 17 April 2009 3:57 pm :: http://blogs.azulsystems/cliff

System.hashCode is cheaper than Random.

Actually, in the absence of contention there's no point in having an array larger than size 1. Since a cache-line is as cheap as a single word, you might as well go for size 2 or 4. For a very high *throughput* queue, you're still missing a trick or to. I'll have to find the paper but there's a way to avoid the cache-miss every time you pass a task-ptr between threads.

Cliff