Why Qualifier Volatile Is Almost Not What You Want in C

On multithreaded systems the programmer has to take care that accesses to a shared resource are done in a well defined manner. This is a non-trivial task and if not done properly it may lead to race conditions or data races. There are many myths about the qualifier volatile of the language C in conjunction with multithreaded programs. In the following we will have a look at the qualifier, what is defined by the C11 standard, in which cases the qualifier is suitable and in which not. The improper use is not limited to multiprocessor systems but even applies to uniprocessor systems with multiple threads where volatile might not do the job as expected.

This article contains no breaking news, i.e., everything written here is known for decades. Still I never made myself familiar with the qualifier and how one should use it properly. I saw a lot of code in which volatile was used to solve problems it was not intended for. Even worse sometimes the qualifier did not solve the problem entirely but made it visible only in rare and hard to debug circumstances. The Linux people even got almost rid of the volatile qualifier in kernel code which they suspect that people clearly don’t know how it works, and incorrectly expect to just fix bugs. I would count myself among those who misused the qualifier. This is my attempt to finally get it right :-)

Here we make use of the word thread as a synonym for any concurrent, i.e., non-sequential, unit including a process. Furthermore, I won’t make a distinction between a CPU and a core, i.e., here we will use them interchangeably and assume only that a CPU/core has a private cache (hierarchy).

tl;dr The qualifier volatile is only intended for

Before we dive into the topic let us first have a look at the C11 standard. The qualifier is defined in section 6.7.3 and mentioned in section 7.13.2.1 where the longjmp function is defined and in section 7.14 where signal handling is defined.

C11 § 6.7.3 Type qualifiers ¶ 7

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously.134) What constitutes an access to an object that has volatile-qualified type is implementation-defined.

134) A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be “optimized out” by an implementation or reordered except as permitted by the rules for evaluating expressions.

An important point is made in this section. Any expression involving volatile-qualified objects must be evaluated strictly according to the rules of the abstract machine. Therefore, any optimization must respect the abstract machine. In section 5.1.2.3 the abstract machine is defined and comments for volatile-qualified objects are scattered all over the place. I recommend to read the whole section although you do not need to in order to understand the remaining parts. Paragraph 4 is of particular importance.

C11 § 5.1.2.3 Program execution ¶ 4

In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).

According to the paragraph an expression may be optimized which involves volatile-qualified objects. However, the range of such an optimization is limited by section 6.7.3 paragraph 7. An optimization may only happen between sequence points but not across a sequence point—what constitutes a sequence point is defined in Annex C of the C11 standard. For example, assume that variable x is initialized before read in the following code:

int volatile x;
/* ... */
int y = x + x;
int volatile x;
/* ... */
int y = 2 * x;

An optimized version of the code of the left-hand side is given by the right-hand side which complies with the C standard. The second access to variable x may be omitted whenever the compiler can deduce that no needed side effect is produced by a read access. Note that even a read access may trigger a side effect on certain architectures. Therefore, such an optimization may only be done in very limited cases.

While compiling the code from above with various compilers (GCC, Clang, ICC) the result was always two memory loads of variable x—although in the experiment it was guaranteed that no side effect occurred by a read access. So far I could not find a compiler and an example where the resulting object code involving volatile objects was optimized in any way. The only exception to the rule is if a volatile variable is declared, not initialized, and never assigned to it. In that case GCC and Clang do not even allocate the variable. Only between volatile and non-volatile objects I found a compiler which tries to optimize object code which we investigate later on. I assume that compiler vendors are very conservative about optimizations and accesses to volatile variables and rather decide not to do any optimizations which may break code. If you are aware of an example and compiler where optimized code is omitted, then I would love to hear about.

Let’s get back and continue examining the C standard. A further important statement is given in section 6.7.3 paragraph 7:

What constitutes an access to an object that has volatile-qualified type is implementation-defined.

For example, assignments have an rvalue and yet according to the GCC specification no read access occurs in the following code:

int obj;
volatile int vobj;
vobj = something;
obj = vobj = something;
obj ? vobj = onething : vobj = anotherthing;
obj = (something, vobj = anotherthing);

Such behavior is even mentioned explicitly in C11 section 6.5.16 footnote 111.

Furthermore, all objects with static storage duration are initialized prior program start (C11 § 5.1.2). Thus, any side-effect which would occur while initializing and therefore writing a volatile static object is not guaranteed to materialize.

So Far So Good

Let’s get this together. According to the C standard

Especially the latter is sometimes mistaken. In the following we will have a look at typical patterns used in multithreaded code and how they behave in conjunction with the qualifier volatile.

False Sharing

Consider the following program with two global shared variables x and y which are volatile qualified, and two functions which are assigned to two different threads, respectively.

char volatile x;
char volatile y;

void thread1(void) {
    x = 1;
}

void thread2(void) {
    y = 1;
}

After both threads execute their corresponding function it may be the case that x equals 1 and y equals 0, or vice versa, x equals 0 and y equals 1. Shouldn’t it be always the case that x and y equal 1?

The compiler has the freedom to lay out both variables next to each other in memory. This is not prevented by the qualifier volatile. Since both variables consume one byte of memory it is quite likely that they share the same cache line. In case the cache is not coherent, then as soon as one variable is written, the other variable is also written implicitly, although it may be outdated.

The problem that two distinct objects share the same cache line and are modified by different cores is well known under the term false sharing. If the cache is coherent, then this results only in a performance loss. However, for non-coherent caches this may result in nondeterministic behavior.

Solution

Switch to a cache coherent architecture ;-) If this is not an option, then you might align the variables manually

char _Alignas(CACHE_LINE_SIZE) x;
char _Alignas(CACHE_LINE_SIZE) y;

where CACHE_LINE_SIZE must be defined to match the size of a cache line. Hence, this code is not portable. The alignment specifier _Alignas was introduced in C11. In case you are using an older version of C, then you might want to use a compiler-specific extension. For example, GCC and LLVM/Clang provide __attribute__(( aligned(42) )) or Microsoft __declspec( align(42) ).

Cache Coherence

Assume we are about to run the following code:

bool volatile done = false;

void thread1(void) {
  while (!done) { /* busy wait */ }

  /* effectively dead code ... */
}

void thread2(void) {
  done = true;
}

Functions thread1 and thread2 are executed by different threads running on different cores, respectively. The basic idea is that thread 1 waits as long as variable done is not true. As soon as the value changes to true the remaining code of the function should be executed.

For architectures without cache coherency it is not guaranteed that the code below the loop will ever be reached since the loop could result in an infinite loop. This can happen in the following two cases. Assume that thread 2 writes value true into variable done.

Case 1
Core of thread 2 decides to write the value into the cache but it never writes through into main memory. In that case thread 1 does not have any chance to fetch the updated value.
Case 2
Core of thread 2 decides to write the value through into main memory, however, the core on which thread 1 runs, does not invalidate its cache line which corresponds to variable done. This results in reading a cached value in every loop iteration of thread 1.

In both cases the loop ends up in an infinite loop rendering all remaining code effectively into dead code.

The qualifier volatile prohibits the compiler to cache values in registers, but it does not prohibit the hardware to cache values.
Solution

If your architecture does not support cache coherency, then you have to work around it. For example the PowerPC architecture provides the instructions cli (Cache Line Invalidate) and dcbf (Data Cache Block Flush). Both could be used before a read/write of variable done in order to make the assignment globally visible.

Note, the memory model of the C11 standard assumes that caches are coherent. Therefore, in all further examples we assume that our architecture provides us with proper cache coherence.

Atomicity

Another problem may arise in the following example.

long volatile x;

void thread1(void) {
    x = -1;
}

void thread2(void) {
    x = 0;
}

It is not guaranteed that after executing both functions by different threads that x equals either -1 or 0.

An assignment to variable x may need several instructions and is therefore not guaranteed to be atomic. This may be the case if

Again, the punchline is that the qualifier volatile is not useful in such situations.

For a further discussion have a look at the excellent paper from Hans-J. Boehm where also adjacent data of the form

struct foo { int x:17; int y:15 };

is taken into account.

Solution

Declare variable x as an atomic object.

long _Atomic x;

Then a write x = -1; or x = 0; is performed atomically in the sense that no thread may ever read an indeterminate value. On assembler/machine instruction level several instructions may be required, however, the machine makes the computation appear to have been carried out atomically. Note, the only guarantee we get by this is that at the end of the execution of both threads variable x contains either 0 or -1, i.e., the order of the assignments is not defined.

The default memory model in C11 is sequential consistency. A less strict memory model could be used in our example since we do not care about memory ordering:

atomic_store_explicit(x, -1, memory_order_relaxed);

Unwanted Optimizations

According to the abstract machine defined by the C standard accesses between volatile variables may not be reordered. However, accesses between volatile and non-volatile variables may be reordered. Consider the following example.

bool volatile done;
int x, y;

void thread1(void) {
    x = y + 42;
    done = true;
}

void thread2(void) {
    while (!done)
        ;
    assert(x == y + 42);    // may fire
}

GCC compiles function thread1 of the left-hand side into code which is almost equivalent to the code of the right-hand side.

void thread1(void) {
    x = y + 42;
    done = true;
}
void thread1(void) {
    done = true;
    x = y + 42;
}

The actual emitted assembler code is:

thread1():
  mov eax, DWORD PTR y[rip]
  mov BYTE PTR done[rip], 1
  add eax, 42
  mov DWORD PTR x[rip], eax
  ret

The optimization introduced by GCC is sound w. r. t. the C standard and the emitted code is also faster since load latency is reduced. However, due to the optimization the invariant candidate x == y + 42 does not necessarily hold at the end of function thread2. Therefore, the assertion may fail.

When the Hardware Gets Into Your Way

The first thought to solve the problem may be to declare variable x also as volatile. The disadvantage of this would be that optimizations involving x would not be allowed and therefore less efficient code would be omitted. Even worse this would not solve the problem in general. The compiler may not reorder instructions but the hardware might. Architectures like x86-64 use a strong memory model and therefore will not reorder instructions (strictly speaking a StoreLoad reordering might still happen which is neglectable in this example). However, architectures like PowerPC use a weak memory model and might very well reorder instructions. Hence, the punchline is again that the qualifier volatile does not guarantee us what we actually need.

Solution

The short answer is: Make use of barriers/fences. In the following I assume that you are familiar with barriers/fences and strong/weak memory models and acquire/release semantics since this is something which is out of scope of this article.

A barrier can be inserted manually and in a non-portable way. For example, in order to insert a software barrier via GCC use asm volatile ("" ::: "memory");. A hardware barrier for the PowerPC architecture could be implemented via the instructions isync and lwsync in order to achieve acquire/release semantics.

Since C11 atomic operations including fences have been standardized. This enables us to come up with a portable version of the example:

#include <stdatomic.h>

atomic_bool done;
int x, y;

void thread1(void) {
    x = y + 42;
    atomic_store_explicit(&done, true, memory_order_release);
}

void thread2(void) {
    while (!atomic_load_explicit(&done, memory_order_acquire))
        ;
    assert(x == y + 42);    // will never fire
}

The given atomic load and store instructions imply an aquire and release barrier, respectively. Note on architectures with a strong memory model as e.g. x86-64, the load and store instructions basically only result in a software barrier.

Wanted Optimizations

For any volatile qualified object we have that at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine according to section 6.7.3 paragraph 7. This leaves very little room for any optimization. Let us have a look at two very basic examples where an optimization is prevented by the qualifier volatile.

int volatile x;

x = 1;
x = 2;
int volatile x;


x = 2;

The first assignment is a dead variable assignment which could be removed if x would not be volatile qualified. However, a volatile declared object could very well correspond to a memory mapped location and therefore an access could trigger a needed side-effect. Therefore, according to the abstract machine such an access may not be removed.

Now consider another example:

int volatile x;
int y;

x = 1;
y = x;
int volatile x;
int y;

x = 1;
y = 1;

Since it must be assumed that the value of variable x changes in ways unknown to the implementation an optimization as, e.g., constant propagation may not be applied.

Solution

Assume that the qualifier volatile was misused in order to indicate that a variable is shared between threads. In that case we could change the type of variable x to an atomic type which is not volatile: int _Atomic x. Variables of such types are intended to be used as shared memory between threads. A compiler is then allowed to perform both optimizations from above which is justified as follows.

Since in both examples no explicit synchronization is done between the variable assignments of x and y, we have the freedom to think of a scheduler which never performs a thread switch between the assignments. Hence, in the first example the state in which x equals 1 is never visible to any other thread which renders the assignment x = 1; effectively into a dead variable assignment. Likewise, for the second example we choose a scheduler such that between the assignments no other thread is scheduled and therefore the value of variable x cannot change. Hence it is sound to perform constant propagation.

N.B. Someone pointed out that such optimizations should not be sound since the observable behavior is different between the unoptimized and optimized version. In other words, not only one scheduler should be taken into account but all. An answer to this objection is that an optimization may not introduce new runs which are not possible in the unoptimized version of the program, however, an optimization may drop runs of the program which are not guaranteed to happen. That means the set of possible runs may be reduced—as long as they are not guaranteed—but not increased. In our example no guarantee is given which renders the optimization sound.

Wrap-up

The first two examples might seem a bit contrived at first. However, cache coherency is not something which can always be expected. Especially, in embedded systems you still find architectures without any cache coherency protocols. Solutions for such problems are typically not portable and special care has to be taken.

The other three examples can be solved in a portable way as long as a C11 compliant setup is used. Since C11 a memory model is defined which specifies a memory ordering which enforces that effects are globally visible. In order to realize the latter cache coherency is required. Therefore, the proposed solutions to these three problems only fit if your architecture is cache coherent. Otherwise, you still have to manually make sure that caches are coherent.

The main takeaway is that in all examples the qualifier volatile is misused and does not solve the actual problem. In the worst case the problem is shifted such that it occurs in rare circumstances and is not found by any meaningful test.

If you are interested in further reading I can recommend these:

All experiments were done with GCC 8.1.1, Clang 6.0.0, and ICC 18.0.0 using options -std=c11 -O3 on an x86_64 Linux distribution.

 

Addendum
C11 Annex C Sequence points

The following are the sequence points described in 5.1.2.3: