I'll start with a warning that entry maybe a bit rambling as I'm using it to help sort out my thoughts more than anything else. As a little side project, I decided to implement a non-blocking almost FIFO. Some of you may point out that Doug Lea already has such an implementation and more over, it’s wait-free, a stricter form of non-blocking algorithms. And, the implementation is strict FIFO. So why would one want an almost FIFO non-blocking but not wait free implementation. The only point of weakness in Doug’s implementation is that it neither stripes reads nor writes.
The first round of concurrent collections to show up in the JDK relied on a technique known as lock stripping. Lock stripping is where you take one lock and replace it with many. For example, ConcurrentHashMap makes use of 32 locks in place of the 1 lock used by HashTable. WIth a proper hash function, ConcurrentHashMap will scale all the way out to 32 CPUs before it’s performance curve starts flattening off. The only way to do better is to get rid of the locks altogether. Though this capability is present in modern multi-core processors, it has not been exposed in Java with a standard API.
The current mechanism to eliminate locks is known as Compare and Swap (CAS). The CAS instruction requires that you know the value in memory and pass that as a parameter along with the new value. If you don’t know the value in memory, you will experience a CAS failure and the memory cell will not be mutated. If you get the value correct then in just about ever case the CAS will succeed. There are some spurious cases where what should be a successful CAS will fail. However, this are supposedly rare events.
CAS operations are exposed in the class sun.misc.Unsafe. This class is appropriately named in that it exposes pointer manipulation capabilities in addition to the CAS functionality, If you are going to CAS a field in a class, you’re going to have to find it’s address in memory. Use of Unsafe exposes your application to C/C++ like pointer bugs. You want to tread lightly. Because of the potential problems with CAS, Doug Lea and others are working to produce a standard safer version of CAS.
The CAS instruction wrapped in AtomicReference.getAndSet() (the one I objected to in an early blog posting) is essential to how Doug Lea’s linked list (FIFO Queue) implementation works. In that implementation, the first to arrive will use getAndSet to slam down a pointer to the next node in the linked list. Thus you have strict FIFO. Or do you?
The reason I ask is, what does strict FIFO mean in a multi-core world? With more than 1 thread making forward progress, there is in effect a race to the head of the queue which is somewhat non-deterministic. In other words, in complex systems, it’s crap shoot as to which thread makes it to the head of any queue prior to any other thread. More over, most people don’t really care about first in first out. What they do care about is an ordering of when things get done. I often call this requirement, first in, first done. While that may have made sense in batch/single core, it makes no sense in multi-core. Again, this is a situation where many threads all making simultaneous forward progress are in a race for either the head or tail of the queue. When this happens, the best or most performant solution is to use application semantics to enforce ordering. This is possibly best understood with the assistance of an example.
The TCP network protocol gives us the capability to create a connection between two processes. On one end of the connection we can write out data that magically appears at the other end. The stream appears on the back end of the connection in the same order that we wrote it out on the front end. We can think of TCP as a strict FIFO queue in that first byte in will be the first byte out. But what of the underling implementation. TCP is built on top of IP. IP has no sense of connection and certainly no sense of order. It simply sends packets of bytes over a channel. Those packets may or may not arrive in the order they were sent. In fact, they may not arrive at all! There is no guarantees at that level of the protocol. IP isn’t even close to being FIFO. It is up to application (TCP in this case) semantics to provide a sense of order and completeness, not the underlying technology.
I have two more comments about the CAS’ing in Doug Lea’s linked list. I need to get an exact reference for this but in one of Cliff Click’s talks, he commented on the scalability of repeated writing to (and surprisingly) reading the same memory location (a.k.a. volatile). It’s not bad but it’s a potential limit to scalability. I can attest to this from my work with Cray’s. In Cray computers, bank cool off times were set to 4 clock ticks. This meant that before you could restrobe a memory location, you had to wait 4 clock ticks. thus memory was arranged that successive locations where found in different memory banks. Striding though memory by 1 or some number that was relatively prime to 2 was essential to good performance. In this case we could say that the hardware striped the writes for you.
The other comment is about CAS failures. If you fail, you have to do something to continue to make forward progress. The AtomicReference.getAndSet() method attempts to slam down a pointer in the next bucket when the CAS fails. By inspection this behavior would send to encourage a lot of CAS failures when working in a highly multi-threaded/multi-core environment. High numbers of CAS failures is also not desirable. But the only reason for this behavior is to enforce a strict FIFO behavior. If you’ve even bought have of the argument that strict FIFO doesn’t make sense at this level, then a means of reducing CAS failures becomes quite apparent. All we need to do is randomly stripe the reads and writes in the same way that HashMap’s hashcode method combined with the insert logic stripes it’s writes. If we do this, we clearly lose the FIFO characteristic of the queue. This is also not desirable.
Cliff Click talks about a style of coding that lends it’s self to non-blocking implementations. After some discussion with him about a non-blocking queue, he came up with the idea that a thread could be randomly assigned a slot in an array. The thread would CAS down the value and if that failed, would move sequentially onto the next slot.
This seems to work reasonably well until you start to think about resizing. WIthout getting into all of the details, the FIFO property of the queue is diluted as the queue gets bigger. Thus is seemed desirable to work out a resizing strategy that did not dilute the FIFO property as the queue got bigger. After a single implementations it became apparent that there was a natural separation between writers and readers and that they should be given different pointers to work with. Thus we could start with an array of a fixed size and then instead of increase that size, just keep adding new arrays to the data structure. As soon as the writers fill the current array, one will attempt a resize. The resize thread will simply create a new array and expose it to the other writers. Thus all writers should automatically move from the primary array to an overflow structure. This leaves the readers to clean out the primary array before promoting an overflow structure and starting to work with it. Thus the natural flow of readers chasing writers is maintained. Data in the queue will most likely not come out in the order that it was inserted but it should be fairly close or “almost FIFO”.
What makes this non-blocking and not wait-free is that writers must wait for the overflow structure to be constructed and hooked up. This is achieved with a wait (as apposed to a lock) If the resize thread happens to get shot down, the waiting threads will wake up and one of them will attempt to resize. If we were to block, all threads could be live locked waiting for the resize to complete.
Another implementation detail to come out is the importance of order and cardinality of critical shared resources. I’ve not worked out a way to expose more than 1 critical pointer without putting the data structure into a precarious state. In my implementation I had to adjust things so that only one pointer for the readers and one pointer for the writers was critical. Since readers and writers are treated separately, it was permissible to have two critical shared resources. Cliff avoided this situation by embedding a control structure into the zeroth element of the underlying array. I chose to use a crippled link list implementation to manage overflow. The reader pointed to the element in the head while the writers pointed to the element in the tail.
The crippled implementation was necessary in order to get the pointer update ordering correct while avoiding extra levels of indirection. This maybe seem a bit confusing but it should be clear once I release the source.
In order to be compatible with the JDK specification for collections, I had to implement a number of methods that just were impossible. The most seemingly trivial was size(). I used Cliff’s non-blocking concurrent counter implementation as the alternative was even less palatable, walking the data structure counting non-null slots. That operation would not only be O(N), it would most likely be completely irrelevant by the time it finished. I’m not even going to discuss Iterator or toArray(). In fact, I’ve yet to implement toArray() as I fail to see it’s usefulness in cases where this implementation could be useful.
What is the goal of this exercise? Primarily it’s a very small baby step on the way to loosening the pessimistic locking model that Java has grown up with. This pessimistic model is akin to programming by exception. I might have a conflict so I have to take the most expensive protective action possible.
Programming by exception is a style of coding where our thoughts are dominated by exceptional cases instead of the intended function. Take the classic money transfer from one account to another. We obtain a lock on both accounts, take from one an put into another, release both locks. We do this just in case two threads both try to update one or other account while we are executing our transfer. We need to lock against the possibility of data corruption. That said, when we observe the behavior of running systems we find that contention is a exceptionally rare event. Yet we are forced to use locks because it can happen even though the probability is either zero or very close to zero. Database vendor learns decades ago that this style of coding is fragile and doesn’t scale.
So while I’ll make no grand claims about the usefulness or importance of the implementation (it is neither) that I’m about to release, I do hope that there are some lessons contained in the code that someone will find useful.
Nice! I've pondered this very solution (just fill up an array and move on
to the next when full) to a scalable near-FIFO queue. I think this
solution only "fails" when the units of work are less than the amortized GC
costs to allocate a single array element (i.e., really really tiny units of
work).
Hey Cliff, thanks for the comments. when I'm a wee bit more confident about
the fencing I wanted to shoot a copy over to you for review.
System.hashCode is cheaper than Random.