Endiana Jones: Raiders of the Lost Byte Order

Writing portable C code is not an easy task, although, the language already abstracts from a lot of machine specific details. One obstacle which is left to the developer is that machines may have different endianness, i.e., the order of bytes of a word depend on a machine. For example, consider the following code snippet:

int i = 0;
unsigned char *pc = (unsigned char *)&i;
*pc = 1;

On a little-endian machine pc points to the least significant byte whereas on a big-endian machine pc points to the most significant byte of object i. Therefore, the value of i depends on the machine’s endianness which renders the code non-portable.

Note, the code snippet is perfectly valid w.r.t. the C standard since a pointer of type unsigned char * may alias any object (see this post for more details about type punning). Nonetheless it is not portable.

In the following we have a look at the principle ideas behind the tool gchecker which aims to determine an endian unsafe access and emit a warning accordingly. For example, running gchecker on the code above results in:

$ gchecker test.c --
test.c:3:1: warning: Possible endian unsafe access of object 'i' of type 'int' through lvalue of type 'unsigned char'

    *pc = 1;
    ^
1 warning generated.

Principle Idea

One of the biggest challenges is to determine whenever type punning occurs. Currently we consider only cases where an object is accessed through an lvalue of incompatible type which rules e.g. type punning via memcpy out. In order to detect such cases we need to keep track of pointers. This is somewhat similar to an alias analysis but still distinct. An alias analysis tries to solve the problem whether two pointer point to the same object. That means from an alias analysis it is not necessarily derivable which object a pointer refers to or even to derive the type of the object pointed to. However, in order to determine type punning it is crucial to know which object a pointer refers to. This becomes even more eminent when pointers to pointers are considered where it is possible that one pointer is changed indirectly. More about this later on. All this leads us to the following three core requirements:

For the moment assume we had an algorithm in order to check for type punning. What is then left is to decide for each case whether it is an endian independent access or not. In practice the idea has become generally accepted to wrap endian dependent parts of a program inside of C preprocessor macros where implementations for different endianness are provided. This holds true especially for pointer and pointer arithmetic dependent computations in which we are interested (other portable methods like conversion of host to network byte order via hton{s,l} do not deal with pointers explicitly). For example, consider the following code snippet.

int i = 0;
unsigned char *pc
#if __BYTE_ORDER == __BIG_ENDIAN
pc = (unsigned char *)&i;
#elif __BYTE_ORDER == __LITTLE_ENDIAN
pc = (unsigned char *)&i + sizeof(int) - 1;
#else
# error Esoteric byte-orders are not supported ;-)
#endif
*pc = 1;

There we have that the pointer arithmetic part, which is endian dependent, is wrapped inside C preprocessor macros.

The basic idea of gchecker is to detect such code ranges and mark any data-flow value during our analysis which “flies” through such ranges as not tainted. Then once type punning is detected a warning is only emitted if the corresponding data-flow value is tainted.

Monotone Analysis Framework

In this section we give a short overview of data-flow analysis or at least the core principles which gchecker is based on namely monotone data-flow analysis frameworks. Thus if you already feel familiar with such frameworks feel free to skip this section.

Data-flow analysis is a research topic on its own and out of scope of this article to come up with a serious introduction. Still in order to have a better understanding of gchecker and its implementation it might be helpful to get a rudimentary intuition why things are implemented as they are. Therefore, let us have a look at the main steps we have to do:

Derive a Control-Flow Graph
In the following we traverse a control-flow graph (CFG, for short) derived from the input program where we denote by \((u,e,v)\) an edge of the CFG from point \(u\) to \(v\) with label \(e\). The label denotes a statement or expression.
Define Data-Flow Values
This is the data type on which the analysis primarily operates. Let us denote this type by \(\D\) for which we require a relation \({\sqsupseteq} \subseteq \D \times \D\) such that \((\D, \sqsupseteq)\) forms a complete lattice.
Define Transfer Functions
For each statement/expression define a monotone transfer function \(\sem{\cdot}\colon \D \to \D\) over complete lattice \((\D,\sqsupseteq)\). Intuitively, the transfer functions model the effects each statement/expression has.
Setup Constraint System
For the start point of our CFG we assume a value \(d_0 \in \D\) which is \(\top\) for our very first implementation (in future work this will be refined). Furthermore, each edge of the control-flow graph gives raise to one constraint. In total we have: \[\begin{array}{rcl@{\qquad}l} G[v] &\sqsupseteq& d_0 & \text{for start point $v$} \\ G[v] &\sqsupseteq& \sem{e} \, G[u] & \text{for edge $(u,e,v)$} \end{array}\]
Solve Constraint System
The basic idea is that a constraint system \(x_i \sqsupseteq f_i(\vec{x})\) can be encoded as a function \(\vec{y} = F(\vec{x})\) and each fixed-point \(\vec{x} = F(\vec{x})\) of the function is a solution to the constraint system. However, such a function may have multiple fixed points. For example, a trivial fixed point is the constant function \(\lambda x{.}\top\) which maps anything to \(\top\). Obviously, we are not interested in trivial fixed points which leads us to the question does there also exist a least fixed point? This is answered by the Knaster-Tarski fixed-point theorem which says if \((\D,\sqsupseteq)\) is a complete lattice and \(f \colon \D \to \D\) is a monotone function, then \(f\) has a least fixed point given by \[\mathsf{lfp}(f) \quad=\quad \bigsqcap \{ x \in \D \mid x \sqsupseteq f(x) \}\]

which is the greatest lower bound of all prefixed points. However, this is not constructive which leads us to the question whether fixed-point iteration computes the least fixed point? That means, does the supremum of the directed \(\omega\)-chain

\[\bot, f(\bot), f(f(\bot)), \dotsc\]

equal the least fixed point? This is answered by the Kleene fixed-point theorem which says that if \((\D,\sqsupseteq)\) is a complete lattice and \(f \colon \D \to \D\) is a monotone function, then

\[\bigsqcup \{ f^i(\bot) \mid i \in \mathbb{N} \} \quad\sqsubseteq\quad \mathsf{lfp}(f)\]

Thus the least fixed point is an upper bound of every \(f^i(\bot)\). Therefore, fixed-point iteration computes the least fixed point. Note, in case complete lattice \(\D\) contains infinite strictly ascending chains, then we may require transfinite iterations.

Meaning of Solution
The interested reader might ask what is the relation of a solution to the constraint system and the actual program? This was answered by Kam and Ullman who showed that for any solution \(G^*\) and therefore also for the least solution of the constraint system that \[G^{*}[u] \quad\sqsupseteq\quad \bigsqcup \{ \sem{\pi} \, d_0 \mid \pi \colon \text{start} \to^* u \} \qquad \text{for any $u$}\]

where \(\pi\) denotes any path from program start to program point \(u\). Thus, in general a solution \(G^*\) is only an overapproximation of the merge over all paths. Only in case the transfer functions \(\sem{\cdot}\) are Scott-continuous, i.e., distributive over the least upper bound, equality holds (see Kildall 1972). Still, the good news is that any solution to the constraint system and therefore also the least solution computed by fixed-point iteration is a safe upper bound of the merge over all paths.

This was a very quick introduction to the fundamental principles of monotone data-flow analysis frameworks. Of course, we left out a lot of details but at least it should give you an intuition of what, why, and how things are done or at least a starting point in order to dive deeper into this topic.

All the theory is nice and good but now let us start looking at gchecker’s implementation. Luckily gchecker is based on Clang which provides a control-flow graph for free. Therefore, we just go next and define our data-flow values in the upcoming section.

Data-Flow Values

In some cases we cannot decide which object a pointer refers to. For example, assume that it is undecidable whether unknown evaluates to true or false, then after the assignment p = unknown ? &i : &j it is undecidable whether p points to object i or j. Typically what is done is an under- or overapproximation of all possible pointer values. In order to support the latter the possible values for our pointer analysis is the powerset of all variables. However, there are cases where we even cannot say anything at all about a pointer for which we introduce an artificial value \(\top\). Thus in total we have:

\[\textsf{Values} \quad:\quad \left(2^{\Vars}\right)^\top\]

This leads us to our core mapping between variables and possible pointed variables:

\[\textsf{State} \quad:\quad \Vars \to \textsf{Values}\]

During the analysis we have to traverse each statement and expression. In C and C++ an expression may have side-effects and at the same time a value. Therefore, our final data-flow value is a tuple where in the first element we keep track of side-effects, i.e., variable mappings, whereas in the second element we keep track of the value of an expression which we are about to traverse:

\[\D \quad:\quad \left(\textsf{State} \times \textsf{Values}\right)_\bot\]

Let us sharpen the intuition why a tuple is required by two examples. Assume we are about to traverse an expression (pc=&i, 42) + 7 where pc and i are variables. Then the side-effect of the assignment is captured in the first component of our data-flow value. Since the expression evaluates to a non-pointer value, the result equals \(\top\) which is encoded in the second component. Thus, if the data-flow value was \(\val{S,V}\) prior the expression, then the value is \(\val{S \oplus \Set{pc \mapsto \{i\}}, \top}\) afterwards. Now consider the expression pc + sizeof(int) - 1 where pc is a pointer. Since the expression has no side-effect the first component, i.e., the state is not altered. However, the expression evaluates to a pointer. Therefore, if we are interested in the rvalue, then the data-flow value after the expression equals \(\val{S, S[pc]}\).

Finally, in order to form a complete lattice we add an artificial element \(\bot\). I leave it up to the reader to show that \((\D,\sqsupseteq)\) forms indeed a complete lattice where \(\sqsupseteq\) is basically pointwise extended subset inclusion over variables.

As a start value \(d_0 \in \D\) for the constraint system we may use the constant mapping \(\lambda x{.}\top\) as a save overapproximation. In total we have

\[d_0 \quad=\quad \val{\lambda x{.}\top, \top}\]

as our start value.

Transfer Functions

While traversing an expression we need to distinguish between lvalues and rvalues. Therefore, in the following we define transfer functions \(\semL{\cdot} \colon \D \to \D\) and \(\semR{\cdot} \colon \D \to \D\) where the former is not total since not every expression has an lvalue. Let us start with a single variable where we have:

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{x} \, \val{S,\_} &=& \val{S, \{x\}} & \text{if $x \in \Vars$} \\ \semR{x} \, \val{S,\_} &=& \val{S, S[x]} & \text{if $x \in \Vars$} \end{array}\]

We denote by \(S[x]\) the lookup of variable \(x\) in mapping \(S\). Let us now consider the transfer function for the unary address-of operator.

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{\&e} \, \val{S,\_} &=& \text{undef} \\ \semR{\&e} \, \val{S,\_} &=& \semL{e} \, \val{S,\_} \end{array}\]

The transfer function is recursively defined. First of all, any expression \(\&e\) has no lvalue and therefore \(\semL{\&e}\) is undefined. For the rvalue we have that this equals the lvalue of expression \(e\). The dual of the unary address-of operator is the unary dereference operator

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{ {*}e} \, \val{S,\_} &=& \semR{e} \, \val{S,\_} \\ \semR{ {*}e} \, \val{S,\_} &=& \val{S', \bigcup \{ S'[x] \mid x \in V' \}} \end{array}\]

where \(\val{S',V'} = \semL{ {*}e} \, \val{S,V}\). The lvalue equals the rvalue of the subexpression \(e\). In order to compute the rvalue we have to lookup all possible variables the expression \(e\) might point to. In case of overapproximation these might be multiple. This is done by first computing the lvalue of expression \({*}e\) and then looking up each variable in the resulting state.

Now consider an assignment of variable \(x \in \Vars\) for which we have

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{x = e} \, \val{S,V} &=& \text{undef} \\ \semR{x = e} \, \val{S,V} &=& \val{S' \oplus \{x \mapsto V'\}, V'} \end{array}\]

where \(\val{S',V'} = \semR{e} \, \val{S,V}\). Again, the expression has no lvalue and only an rvalue. The result of the expression is an updated variable assignment which consists of all previous side-effects plus all side-effects of the right-hand side and of course the current one itself. The general case of an assignment is similar but distinct in the sense that we have to deal with imprecision whenever we do not know the precise variable of the left-hand side. We have

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{e' = e} \, \val{S,V} &=& \text{undef} \\ \semR{e' = e} \, \val{S,V} &=& \begin{cases} \val{S'' \oplus \Set{ x \mapsto V' }, V'} & \text{if $V'' = \{x\}$} \\ \val{S'' \bigoplus \Set{x \mapsto S''[x] \sqcup V' \mid x \in V''}, V'} & \text{if $V'' \supset \{x\}$} \\ \val{\lambda x{.}\top,V'} & \text{if $V'' = \top$} \end{cases} \end{array}\]

where \(\val{S',V'} = \semR{e} \, \val{S,V}\) and \(\val{S'',V''} = \semL{e'} \, \val{S',V'}\). In case expression \(e'\) evaluates to a singleton, then the assignment is the same as the former one. Otherwise, if the expression evaluates to a set of multiple variables, then in order to be precise we would have to come up with a disjunction tracking each possible assignment separately. Of course, this is not feasible. Therefore, we introduce some kind of overapproximation by adding to each variable \(e'\) may point to, the value to which expression \(e\) evaluates. In the last case, if expression \(e'\) evaluates to an unknown value, then we must assume that the assignment may rewrite any variable. Thus in order to be correct, we return the state which maps anything to \(\top\). This could be relaxed similar to the previous case by performing a non-destructive update of all variable bindings. However, for our first implementation we go with this implementation as this is straight forward and easy to implement.

As one of the last examples we have a look at the binary operators plus and minus. There we have that whenever an integer is added/subtracted to a pointer that the resulting pointer is again a pointer into the very same object (see this post for more details about pointer arithmetic). This leads us to the following definition

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{e_1 \pm e_2} \, \val{S,V} &=& \text{undef} \\ \semR{e_1 \pm e_2} \, \val{S,V} &=& \begin{cases} \semR{e_1} \, \semR{e_2} \, \val{S,V} & \text{if $e_1$ has pointer type and $e_2$ does not} \\ \semR{e_2} \, \semR{e_1} \, \val{S,V} & \text{if $e_2$ has pointer type and $e_1$ does not} \\ \val{S',\top} & \text{otherwise} \end{cases} \end{array}\]

where \(\val{S',\_} = \semR{e_1} \, \semR{e_2} \, \val{S,V}\). A similar semantics has the comma operator where we are interested in the side-effects of both operands but only in the value of the second operand:

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semL{e_1, e_2} \, \val{S,V} &=& \text{undef} \\ \semR{e_1, e_2} \, \val{S,V} &=& \semR{e_2} \, \semR{e_1} \, \val{S,V} \end{array}\]

For all other unary and binary operators \(\square\) expressions \(\square\,e\) and \(e \mathbin{\square} e'\) do not have an lvalue but an rvalue. The transfer functions for rvalues are then given as follows

\[\begin{array}{r@{\quad}c@{\quad}l@{\qquad}l} \semR{\square\,e} \, \val{S,V} &=& \val{S',\top} \\ \semR{e \mathbin{\square} e'} \, \val{S,V} &=& \val{S'',\top} \end{array}\]

where \(\val{S',V'} = \semR{e} \, \val{S,V}\) and \(\val{S'',\_} = \semR{e'} \, \val{S',V'}\). Since those expressions do not evaluate to pointer values, we are just interested in accumulating possible side-effects.

Soundness vs Completeness

Determining type punning is in general undecidable since it can be reduced to the halting problem. Thus a sound and complete algorithm does not exist and we have to sacrifice one or the other. A trivial complete algorithm would be to emit a warning for every dereference. Likewise, a trivial sound algorithm would be to never emit a warning. Obviously both algorithms are not helpful at all. Typically what is done is to come up with an under- or overapproximation. Performing underapproximation has the advantage that no false positive is emitted, since here the question is solved whether over all paths type punning happens or not. In contrast to overapproximation where the question is solved whether there exists a path with type punning. That means overapproximation is a relaxation of underapproximation which allows to emit more diagnostics with the possibility to emit false positives. This is demonstrated by the following code snippet.

int i = 0;
unsigned char c;
unsigned char *pc;

if (unknown)
  pc = (unsigned char *)&i;
else
  pc = &c;

*pc = 1;

Assume it is undecidable whether unknown evaluates to true or false. While performing underapproximation no diagnostic would be emitted since there exists one path for which pc points to an object of compatible type. In contrast if overapproximation is performed a diagnostics would be emitted since there exists one path where type punning happens. However, this bears the risk that at runtime unknown always evaluates to false which would render the diagnostic a false positive.

Reducing the Set of Paths

In case of overapproximation, reducing the set of possible paths may of course help to reduce the number of false positives. Hence, it is tempting to make use of other analyses from Clang as e.g. value propagation, dead-code elimination and so forth in order to exclude paths. However, those analyses and transformations may itself operate on endian dependent values. In other words, a decision may be made to remove or rewrite code which depends on the machine’s endianness itself. Consider the following example where pc either points to the least or most significant byte of an integer object.

int foo() {
  int i = 1;
  unsigned char *pc = (unsigned char *)&i;
  return *pc;
}

Compiling the example with Clang (version 10, option -O1) results in optimized code as one would expect from a modern compiler.

; on little-endian machines
define i32 @foo() {
  ret i32 1
}

; on big-endian machines
define i32 @foo() {
  ret i32 0
}

Any analysis which runs after such a transformation is not able to detect type punning anymore.

In total we have that if a transformation is made upon endian independent values, then this might help to reduce the set of possible paths and therefore also false positives. Whereas if a transformation is made upon endian dependent values, then it is possible that some cases of type punning are not detected anymore. This is not inherently bad, but shows the pretty pickle we are in when considering soundness and completeness.

Wrap-up

Type punning may lead to non-portable code if the machines endianness is not taken into account. In C there exist three ways in order to perform type punning namely via memcpy, unions, and through pointer casts. If done through pointer casts, then it is more or less well established that a portable program wraps multiple endian dependent implementations inside C preprocessor macros. Therefore, emitting a diagnostics depending on whether a pointer value “flew” through such a range or not seems a reasonable approach. Time of writing this post, gchecker is still missing some core features before a decent evaluation by running it on non-trivial code can be done. Some of the missing features are:

There is also the question of how to benefit from other implementations? For example, an integration into the Clang Static Analyzer could already help to support some of the missing features like an interprocedural analysis.