(out of order) Effective Concurrency: Writing Lock-Free Code — A Corrected Queue

Herb Sutter Concurrency 2008-11-02 1 Minute

Oops, I just noticed that I forgot to blog about one recent Effective Concurrency column: “Writing Lock-Free Code: A Corrected Queue” which also appeared in the October 2008 print issue of Dr. Dobb’s Journal.

From the article:

As we saw last month [1], lock-free coding is hard even for experts. There, I dissected a published lock-free queue implementation [2] and examined why the code was quite broken. This month, let’s see how to do it right.

Here is the complete-as-of-this-writing set of links to the published Effective Concurrency columns. As always, the months reflect the magazine print issue dates; they usually hit the web a bit sooner:

August 2007: The Pillars of Concurrency

September 2007: How Much Scalability Do You Have or Need?

October 2007: Use Critical Sections (Preferably Locks) to Eliminate Races

November 2007: Apply Critical Sections Consistently

December 2007: Avoid Calling Unknown Code While Inside a Critical Section

January 2007: Use Lock Hierarchies to Avoid Deadlock

February 2008: Break Amdahl’s Law!

March 2008: Going Superlinear

April 2008: Super Linearity and the Bigger Machine

May 2008: Interrupt Politely

June 2008: Maximize Locality, Minimize Contention

July 2008: Choose Concurrency-Friendly Data Structures

August 2008: The Many Faces of Deadlock

September 2008: Lock-Free Code: A False Sense of Security

October 2008: Writing Lock-Free Code: A Corrected Queue

November 2008: Writing a Generalized Concurrent Queue

December 2008: Understanding Parallel Performance

Published by Herb Sutter

Herb Sutter is an author and speaker, a software architect at Microsoft, and chair of the ISO C++ standards committee. View all posts by Herb Sutter

Published 2008-11-02

10 thoughts on “(out of order) Effective Concurrency: Writing Lock-Free Code — A Corrected Queue”

Pingback: Concurrent programming is so assembler « Blockheading about C++
Lauri says:

2008-11-27 at 7:30 am

Hi Herb,

Thank you for a great article series! It’s been a very interesting read especially since one of our challenges at work is to make our application more parallel.

I’ve been looking into replacing our mutex-based queues with something like your example concurrent queue. Two things that are not obvious to me are queue size counts and blocking when the queue is full.

It seems to me that getting exact sizes out of the queue will require locking both ends and this will kill the concurrency. Another options would be to count queued and dequeued items. Each count is incremented in the critical section of the produce and consume functions. Finally, the size function would subtract the consumed count from the produced count to get an approximate queue size. This would be sufficient for metrics etc.

The other related problem is that we have a pipeline of queues in our app. The queue at the end of the pipeline exerts backpressure towards the front of the pipeline by blocking producers when the queue is ‘full’. We can use the approximate size with a high watermark when deciding to block. But how to block? It’s easy enough to busy-wait in the critical section testing whether the queue size has fallen below a low water mark. But is there a way to ‘upgrade’ the spin-lock to a condition variable that the producer notifies without negating the peformance improvements achieved by the concurrent design?

To be honest, the app is hosed if the pipeline ever blocks, but the current design relies on this functionality, so it would be less work to simply support it.

— Lauri
Stefan says:

2008-11-10 at 4:15 am

I am missing the published sources of your lockfree queue. Where do i find them? Or is this a rather experimental approach ?
Fred Roeber says:

2008-11-07 at 7:15 am

Herb, I’ve been following your articles with great interest. These on concurrent queues have been most interesting. I’ve done lots of work over the past few decades on embedded OS and multiprocessor projects that used such queues. I’m trying to understand the full details of recent language upgrades and their impact on getting concurrent code to work right.

Based on my understanding, I think that your proposed Concurrent Queue implementation may have some bugs. Also, since the use of such queues is typically central to system operation, their efficiency has to be optimized. I wonder if a change I’ll suggest below to your producer algorithm wouldn’t improve performance.

First a question about potential bugs. Your earlier “Lock-Free Code: A False sense of security” article showed all the potential pitfalls you can have in terms of reordering and false writes if the code isn’t written right. In your queue implementation you only made the two lock variables and the “next” pointer in the node atomic. You didn’t make the critical “first” and “last” pointers atomic.

What assures that the updates to those pointers made inside the critical sections are completed before the lock on the critical section is released? Does the release of the atomic lock variable perform an implicit memory barrier operation that assures all prior operations have completed? If this is the case then I can understand why first and last don’t need to be atomic since they are always updated inside a critical section. However, if this is the case, why does the next pointer have to be atomic since it is also always updated in a critical section. I would think that first and last also really need to be atomic.

As to efficiency, I have a question about your producer implementation. A central part of your implementation (starting after the lock is acquired) is shown below:

Node* theFirst = first;
Node* theNext = first->Next;
if( theNext != nullptr ) {
T* val = theNext->value;
theNext->value = nullptr;
first = theNext;
consumerLock = false;
result = *val;
delete val;
delete theFirst;
return true;

Wouldn’t an alternative implementation that uses a dedicated dummy node that doesn’t change be more efficient? Such an implementation (which is what I typically use) certainly requires fewer operations both inside and outside the critical section to correctly update it. I don’t think this change has negative cache implications. The alternative implementation of the same code section would be:

Node* theNext = first->Next;
if( theNext != nullptr ) {
first->Next = theNext->Next;
consumerLock = false;
result = *(theNext->value);
delete theNext->value;
delete theNext;
return true;

Am I missing something?

Thanks for all the great articles on tricky subjects that are becoming more important with the increasing application of the type of multiprocessing that used to only be found in specialized systems. Fred
Alex Espinoza says:

2008-11-06 at 3:01 pm

I think I answered my own question. I re-read the article, I obviously didn’t pay attention to the details.

There is no need to use lock, because we only have one producer. Atomic (or Volatile) works great here because writes are done by one thread which are always atomic.

In other words, Atomic variables in C++ (and volatile in C#) don’t have built-in synchronization, but they can be used in scenarios like one producer to many consumers.
Alex Espinoza says:

2008-11-05 at 12:58 pm

Herb,

I have a question regarding your article: Apply Critical Sections Consistently from last year (sorry but I am still catching up with my reading).

I read the article, and in the Example 2: “One Producer, Many Consumers, Using Lock-Free Mailboxes” I got the idea that you were using atomic values for synchronization. I am no expert in C++ synchronization, so I don’t know if atomic variables provide ‘built-in’ synchronization. My understanding in C# is that volatile only prevents reordering and caching of the variable in read/write scenarios, there is no implicit synchronization.

So, what I am asking is: Is atomic variables in C++ (and volatile in C#) have ‘built-in’ synchronization; meaning that you don’t need to use Locks to synchronize ?

Thanks,
Alex.
Herb Sutter says:

2008-11-04 at 3:56 pm

David asked: “is there any chance of seeing the atomic implemented in a C compiler?”

The C++0x atomics library deliberately has a C-compatible mode, where for example you can spell atomic instead (and equivalently) as atomic_int, and use plain old functions instead of member functions.
David says:

2008-11-04 at 12:51 pm

For those of us stuck using C in legacy systems is there any chance of seeing the atomic implemented in a C compiler? Is our only choice to upgrade to a C++ compiler that supports our target?
Herb Sutter says:

2008-11-03 at 7:42 am

The plan is that in the fullness of time there’ll be a book with updated and expanded versions of these columns plus some new material.
John says:

2008-11-03 at 2:54 am

Hey Mr. Sutter,
have you ever thought
of making a PDF of all the
“Effective Concurrency” columns
so we could read them all in one go?