Debugging Race Conditions and Deadlocks

Race conditions and deadlocks are two of the most common problems encountered in multi-threaded systems. They are often difficult to reproduce, and therefore hard to find and fix.

I’ve had success finding and fixing these problems in our team’s code using a relatively simple technique.

I recently had to track down a race condition in some code we were developing. The code in question collects data from a device and makes it available to an outside application. Due to the nature of the device, the data must be collected in discrete samples. Once a sample is written, a “sample ready” flag is set to transfer ownership of the sample to the other application.

Over time, as is typical, new requirements were added:

We needed to collect more of the available data during the sampling period. That is, we needed to get the discrete samples closer together in time. Since it took some time to write the samples, we decided to spin off some background writing threads. That way, the main thread could be accumulating the next sample while previous samples were still being written.
It was possible for a sample to get interrupted before it was completely written. We needed to clean out these partial samples from time to time.

We were later asked to be more aggressive about cleaning out partial samples, so I modified the code to clean them out at the end of each sampling period. When I deployed that version of the code, I started getting frequent exceptions.

After looking at the stack traces, it was pretty clear that samples were disappearing they were still being written. I was pretty sure that the external application wasn’t deleting them, so the problem had to be in our code.

The code in question was spread out between several classes and methods, so I started by making a linear list of the operations performed by each of the threads in the application. I usually do this in my head, but sometimes I’ll write it down on paper or a whiteboard.

At a high level, the code boiled down to the following:

# Main thread

While sampling is active
  Collect a sample from the device
  Spawn a thread to write the sample
  Repeat
Clean up partial samples

# Sample-writing thread(s)

Write the sample
Set the "sample ready" flag
Exit

I then experimented with interleaving the lines of code in various ways to see if I could spot a race condition. There are a lot of permutations and they would take some time to go through, but in this case, I knew that the samples were disappearing on me. That led me to suspect the Clean up partial samples line immediately, so I looked specifically at interleavings involving that code. Consider this order of statement execution:

  Spawn a thread to write the sample
                  Write the sample
                  Set the "sample ready" flag
                  Exit
Clean up partial samples

With this statement ordering, there is no problem. The sample is completely written before any cleanup is done.

  Spawn a thread to write the sample
                  Write the sample
                  Set the "sample ready" flag
Clean up partial samples
                  Exit

Still no problem. The “sample ready” flag is set before the cleanup operation starts.

But consider this order:

  Spawn a thread to write the sample
                  Write the sample
Clean up partial samples
                  Set the "sample ready" flag
                  Exit

In this order, the cleanup step starts running while the sample is still being written. Because the sample that is being written looks like a partial sample to the cleanup step, it gets deleted, causing the exception.

The solution was to add a wait on the main thread when coming out of the sampling loop:

While sampling is active
  Collect a sample from the device
  Spawn a thread to write the sample
  Repeat
Wait for all sample-writing threads to finish   # NEW!
Clean up partial samples

By analyzing different potential statement execution orders, it is often possible to find race conditions and deadlocks. This analysis can be done at several levels of detail. The right level of detail depends on how the code is written, the programming language, the runtime environment, the operating system, and the hardware.

Often, as in this example, you can stay at a pretty high level and find a solution. However, sometimes you have to go down to an even finer grain than statements in your programming language. Consider the following:

x += 5

This seems like a harmless operation, but under the hood, the following low-level operations are being performed:

read the current value of x
                  # What happens if x is changed here?
add 5 to that value
                  # Or here?
store the new value back to x

If another thread happens to change the value of x some time between the read and the store, incorrect behavior results. Many platforms provide atomic operations to protect against this kind of situation.

If analysis at a high level doesn’t help you find the problem, then trying breaking things down to finer-grained operations. Over time, you will start to develop a sense about where the problems lie, and you’ll be able to go through this analysis more quickly.

I often find that just going through the code to make the linear list of operations will be enough to give me a clue about what’s going on. This is especially true if I didn’t write the code initially, or I wrote it long enough ago that I forget some of the details.

I hope this provides you with a simple but valuable tool you can use when you run into threading issues in the future.

Comments