Arrays Are Second-class Citizens in C

Caveat: The following post should be seen more as a hodgepodge of individual contributions than a self-contained contribution.

In general an array is a contiguously allocated non-empty collection of objects at equally spaced addresses. Each element can be accessed by its corresponding index tuple.

Array Types

In C an element of an array may be accessed in multiple ways. Assume that we are given an array int a[10];. Are the following expressions all the same and point to the first element of array a?

The simple answer is yes and no. They all point to the first element of a. However, their types are all different.

The type of expression

Although all expressions have a different type, they all point to the first element of array a. Hence, all of the following conditionals evaluate to true.

(void *) a == (void *)&a
(void *) a == (void *)&a[0]
(void *)&a == (void *)&a[0]

Strange, isn’t it? Let’s have a look at this piece by piece.

An object of type int [10] is converted implicitly to type int* whenever required. For example, int *p = a; is fine and the conversion to pointer-type is done implicitly. The identifiers p and a can then be used interchangeably, i.e., p[i] == a[i] for 0 <= i <= 9 (see also C11 section 6.5.2.1 paragraph 2 for more details). This is defined in section 6.3.2.1.

C11 § 6.3.2.1 Lvalues, arrays, and function designators ¶ 3
[…] an expression that has type “array of type” is converted to an expression with type “pointer to type” that points to the initial element of the array object and is not an lvalue. […]
Except when it is the operand of the sizeof operator, the _Alignof operator, or the unary & operator, or is a string literal used to initialize an array, an expression that has type “array of type” is converted to an expression with type “pointer to type” that points to the initial element of the array object and is not an lvalue. If the array object has register storage class, the behavior is undefined.

Whereas for &a and &a[0] we have that the unary operator & is involved and the semantics for such expressions is defined in section 6.5.3.2.

C11 § 6.5.3.2 Address and indirection operators ¶ 3

The unary & operator yields the address of its operand. If the operand has type “type”, the result has type “pointer to type”. If the operand is the result of a unary * operator, neither that operator nor the & operator is evaluated and the result is as if both were omitted, except that the constraints on the operators still apply and the result is not an lvalue. Similarly, if the operand is the result of a [] operator, neither the & operator nor the unary * that is implied by the [] is evaluated and the result is as if the & operator were removed and the [] operator were changed to a + operator. Otherwise, the result is a pointer to the object or function designated by its operand.

Since &p is of type pointer-to-array we have to declare p as follows int (*p)[10] = &a;. Note the parentheses around *p. This is required because of operator precedence of *—similarly to function pointers where parentheses are required. We then have *p[i] == a[i] for 0 <= i <= 9.

The strange thing about the type pointer-to-array is that we have to dereference which is not the case for the other two expressions, although all three expressions point to the same memory location. So far I have not seen the use of the type pointer-to-array in practice and would not recommend it.

Multidimensional Array Types

Every dimension of an array is defined by appending a pair of squared brackets enclosing an expression to an array identifier. For example, int a[10][20][30]; defines an array with size 10 in its first dimension, size 20 in its second dimension, and size 30 in its third dimension. Hence the array consists of 10 ⋅ 20 ⋅ 30 = 6000 integers in total.

The type of expression

Like for one-dimensional arrays a multidimensional array is adjusted into a pointer, too. The type of expression &a[0] is a pointer-type by losing the information of the first dimension. However, the number of remaining dimensions and their corresponding lengths are preserved.

This conversion has a drawback. From a types perspective you might not distinguish arrays of different types. The following code snippet is well typed.

int x[3];
int y[2][3];
&x == &y[0];

The type of the expressions &x and &y[0] is int (*)[3] although they point to the first elements of arrays of different size and number of dimensions. On the other hand, if one makes us of x or &x[0] instead of &x, then the conditional expression would not be well typed anymore and a compiler could warn about a potential misuse. From a types point of view everything is fine since both expressions &x and &y[0] point to a memory location of at least three integers.

Dimension Two, Three, Four, …

The first dimension while referring to an array is somewhat special in the sense that the precise number of elements need not be defined by the type but for all remaining dimensions it must be. For example, int [][20][30] and int (*)[20][30] or even int [10][20][30] are valid types for any three-dimensional array where the second and third dimensions have precisely 20 and 30 elements, respectively. No matter what the size of the first dimension of an array actually is. In contrast to any other dimension where the number of elements must match precisely.

Let us illustrate this by an example.

int x[10][20][30];
int (*y)[2][30] = x;   // warning

The number of elements of the second dimension of x and y do not match and LLVM/clang outputs the following warning:

warning: incompatible pointer types initializing 'int (*)[2][30]' with an expression of type 'int [10][20][30]' [-Wincompatible-pointer-types]
    int (*y)[2][30] = x;
          ^           ~

In order to understand the warning and the problem behind it, it is crucial to know the C memory model of an array. Any array of any number of dimensions is flattened into one block of continues memory. This is accomplished by allocating one row after another consecutively, i.e., the memory model of an array in C is in row-major format. For example an array of two dimensions where the first dimension has length N+1 and the second dimension has length M+1 has the following memory layout:

  [0,0] [0,M] [1,0] [1,M] [N,M]  

One row after another one is allocated in memory. The general case is quite similar. Consider an array with \(d \geq 2\) dimensions \(\text{int } a[N_1][N_2] \cdots [N_d];\) where \(N_i, 1 \leq i \leq d\) defines the number of elements of each dimension, respectively. If we are about to access the element \(a[n_1][n_2] \cdots [n_d]\) then the offset from the base address \(a\) is the sum of all indices multiplied with the remaining dimensions sizes

\[ ( n_1 N_2 N_3 \cdots N_d ) + ( n_2 N_3 N_4 \cdots N_d ) + \cdots + ( n_{d-1} N_d ) + n_d \]

which is equivalent to

\[ \sum_{i=1}^d \left( \prod_{j=i+1}^d N_j \right) n_i \]

Note that we do not make use of the size of the first dimension \(N_1\) but of all other sizes \(N_i, 2 \leq i \leq d\). Therefore, in order to compute the address of a multidimensional index the C standard requires that all dimensions sizes must be given but the first one which is optional.

This recursive subscripting is defined in C11 section 6.5.2.1 for arrays with two or more dimensions.

C11 § 6.5.2.1 Array subscripting ¶ 3

Successive subscript operators designate an element of a multidimensional array object. If E is an \(n\)-dimensional array (\(n \geq 2\)) with dimensions \(i \times j \times \cdots \times k\), then E (used as other than an lvalue) is converted to a pointer to an (\(n - 1\))-dimensional array with dimensions \(j \times \cdots \times k\). If the unary * operator is applied to this pointer explicitly, or implicitly as a result of subscripting, the result is the referenced (\(n - 1\))-dimensional array, which itself is converted into a pointer if used as other than an lvalue. It follows from this that arrays are stored in row-major order (last subscript varies fastest).

Initializing Arrays

An array is initialized either via a compound literal, an initializer list, or a string literal. A compound literal is described in C11 section 6.5.2.5 as “a parenthesized type name followed by a brace-enclosed list of initializers”. An initializer list is in turn described in C11 section 6.7.9 as “a brace-enclosed list of initializers for the [array] elements”. The third variant is an initialization via a string literal.

int  x[]  = (int[]){ 1,2,3 };  // compound literal
int  y[]  = { 1,2,3 };         // initializer list
char z[]  = "123";             // string literal

A compound literal is beneficial if you want to create an unnamed array:

void *p = (int[10]){ 1,2,3 };

An array of ten integers is defined where the first three elements are initialized to 1, 2, 3, and the remaining integers to 0. After the assignment pointer p points to the first element of the unnamed array.

Re-initializing an array is, however, not allowed in contrast to a variable:

char x[] = "123";
x = "456";        // error

char *y = "123";
y = "456";        // ok

Since C99 a distinction is made between an lvalue and a modifiable lvalue.

C11 § 6.3.2.1 Lvalues, arrays, and function designators ¶ 1
[…] A modifiable lvalue is an lvalue that does not have array type […]
An lvalue is an expression (with an object type other than void) that potentially designates an object;64) if an lvalue does not designate an object when it is evaluated, the behavior is undefined. When an object is said to have a particular type, the type is specified by the lvalue used to designate the object. A modifiable lvalue is an lvalue that does not have array type, does not have an incomplete type, does not have a const-qualified type, and if it is a structure or union, does not have any member (including, recursively, any member or element of all contained aggregates or unions) with a const- qualified type.

64) The name “lvalue” comes originally from the assignment expression E1 = E2, in which the left operand E1 is required to be a (modifiable) lvalue. It is perhaps better considered as representing an object “locator value”. What is sometimes called “rvalue” is in this International Standard described as the “value of an expression”. An obvious example of an lvalue is an identifier of an object. As a further example, if E is a unary expression that is a pointer to an object, *E is an lvalue that designates the object to which E points.

Although an array is a lvalue it is not a modifiable lvalue. Therefore, any re-initialization fails.

Sparse Array Initialization

Assume you want to initialize an array and only few elements should have a non-default value. In such a case you might want to make use of a designator denoted by [ and ].

#define N 100
int a[N] = { 0,1,2, [10]=10,11,12, [N/3-3]=30,31,32, [N-3]=97,98,99 };

For array a we have a[i] == i for all i in {0,1,2,10,11,12,30,31,32,97,98,99} and for every other index i we have a[i] == 0, i.e., a[i] is initialized with the default integer value.

The expression enclosed by the designator must be a constant integer expression.

Initializing Character Arrays

Arrays of characters are somewhat special in the sense that they may be initialized in multiple ways. Before we go on recap that any string is terminated with the null character '\0', i.e., a byte with all bits set to 0. Now have a look at the following (incomplete) list of array initializations:

char a0[]   = "123";
char a1[3]  = "123";
char a2[4]  = "123";
char a3[99] = "123";
char a4[]   = { '1', '2', '3' };
char a5[]   = { '1', '2', '3', '\0' };
char a6[]   = { "123" };

We have that each array contains as the first three elements the characters '1', '2', and '3'. Beside that some contain the null character and some do not. The arrays a1 and a4 do not contain the null character and their size is exactly three. Whereas every other array has as its fourth element the null character and therefore represent a proper string. The array a4 has even 96 times the null character. This is defined in C11 section 6.7.9.

C11 § 6.7.9 Initialization ¶ 14

An array of character type may be initialized by a character string literal or UTF-8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.

C11 § 6.7.9 Initialization ¶ 21

If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.

Non-modifiable Character Arrays

We have already seen that arrays are not pointers. For character arrays, however, it is even more subtle then for arrays of other types:

char x[] = "123";
char *y  = "123";

The first assignment initializes an array with automatic or static storage duration (depends on whether x is declared in file or function scope) which is modifiable. Whereas the second assignment initializes a pointer to an array with static storage duration which is not necessarily modifiable.

Non-constant Array Initializer

Since C99 an initializer list may contain non-constant expressions.

int a[] = { foo(), bar() << 5, x = 42 };

However, care has to be taken if the expressions involve side effects, since the order of evaluation is not defined.

Arrays as Parameters

An array cannot be passed by value opposed to scalar values as, e.g., an integer or a double. Instead a pointer to the first element of an array is passed.

C11 § 6.7.6.3 Function declarators (including prototypes) ¶ 7

A declaration of a parameter as “array of type” shall be adjusted to “qualified pointer to type”, where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression.

The paragraph defines several properties for parameters of array-type. Let us have a look at an example.

void foo(int x[3]) { }

foo((int[]){ 1,2,3,4,5 });

The first sentence of paragraph 7 defines that every array-type is adjusted to a pointer type. For our example this means that the declaration void foo(int *x); is compatible with the function definition, i.e., the array-parameter decays into a pointer. Since the parameter of the function call is adjusted to pointer type int*, too, the program is well-typed—although the parameter has more elements than x.

In the second part of paragraph 7 a subtle statement is made. If the keyword static occurs inside squared brackets of an array declaration, then the size of the passed array must be greater or equal than the size of the declared parameter. In order to grasp the idea of this let us have a look at an example.

void foo(int a[10][20][30]);
void foo(int (*a)[20][30]);
void foo(int a[][20][30]);
void foo(int a[100][20][30]);

void foo(int a[10][20][30]) { }

void bar(int a[static 10][20][30]) { }

void baz(void) {
    int x[1][20][30];
    int y[1000][20][30];

    foo(x); // ok
    foo(y); // ok

    bar(x); // fail
    bar(y); // ok
}

All given function prototypes for foo are compatible and they are also compatible with the corresponding function definition. Function bar is defined with exactly one parameter of type array which contains the keyword static. In function baz two arrays x and y are defined with different sizes in their first dimension. Both function calls to foo, which reference the arrays x and y respectively, are valid although the size of the first dimension of the passed array does not match the size of the first dimension of the array at the side of the function definition. In contrast to the first function call of baz where at least a warning should be raised. Since x has fewer elements in its first dimension than it is required according to the array-parameter of the function definition in bar, a promise is violated.

The static keyword in an array-declaration has also another side effect:

The keyword static in an array-declaration implies that the function parameter must not be a null-pointer.
The Hacky Way

If you insist on passing an array as a parameter you might want to do it by wrapping an array inside a structure.

struct foo { int a[10][20][30]; };

void f(struct foo x) { }

f((struct foo){ });

A structure is passed by value in contrast to an array which is passed by reference. Beware that this means you have to copy a lot of data if the array is huge.

Size of an Array

Passing arrays by reference and not by value has another side effect. Consider the following code:

bool foo(int x[10]) {
    int y[10];

    return sizeof(x) == sizeof(y);
}

Since parameter x is adjusted to pointer type we have that sizeof(x) == sizeof(int*) whereas for y we have that sizeof(y) == sizeof(int)*10 holds. Hence as long as on your architecture a pointer has not ten times the size of an integer, the function always returns false ;-) This is defined in section 6.5.3.4.

C11 § 6.5.3.4 The sizeof and _Alignof operators ¶ 4
When sizeof is applied […] to an operand that has array type, the result is the total number of bytes in the array.103) […]
When sizeof is applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1. When applied to an operand that has array type, the result is the total number of bytes in the array.103) When applied to an operand that has structure or union type, the result is the total number of bytes in such an object, including internal and trailing padding.

103) When applied to a parameter declared to have array or function type, the sizeof operator yields the size of the adjusted (pointer) type (see 6.9.1).

Variable Length Arrays

Since C99 an array may have a size which is not statically known. Lets have a look at an example.

size_t foo(int n) {
    int a[n+42];
    return sizeof(a);
}

The size of array a is not statically known and therefore stack space has to be allocated at run-time. A side effect of this is that the sizeof operator cannot be evaluated at compile-time; unlike for any other type.

Once a variable length array is allocated its size does not change during its lifetime.

void foo(int n) {
    char a[n];
    n = 42;
    printf("%zu\n", sizeof(a));
}

The size of array a depends on the value of n at allocation time. A later change of variable n does not change the size of a. For example, a call foo(5) prints 5.

A somewhat similar property holds for a variable length array type denoted by a typedef.

int n = 10;
typedef char T[n];
T x;
n = 20;
T y;

Arrays x and y have 10 elements, respectively.

Note that the size of a dimension of a variable length array may be given as an arbitrary expression which might have side effects.

int a[x <<= y];
int b[x <<= y][x];   // undefined behavior

The order of evaluation is not defined. Hence the second array declaration imposes undefined behavior.

Opposed to statically allocated arrays, variable length arrays may not be initialized via initializer lists. Assume that n is not statically known in the following code fragment:

int a[n] = { 1,2,3 }; // error

Hence, an initialization has to be done by hand-written code.

Variable length arrays are a conditional feature that implementations need not support. In contrast to C++17 which does not support variable length arrays at all.

Flexible Array Members

A flexible array member is defined in section 6.7.2.1:

C11 § 6.7.2.1 Structure and union specifiers ¶ 18
[…] the last element of a structure […] may have an incomplete array type; this is called a flexible array member. […]
As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member. In most situations, the flexible array member is ignored. In particular, the size of the structure is as if the flexible array member were omitted except that it may have more trailing padding than the omission would imply. However, when a . (or ->) operator has a left operand that is (a pointer to) a structure with a flexible array member and the right operand names that member, it behaves as if that member were replaced with the longest array (with the same element type) that would not make the structure larger than the object being accessed; the offset of the array shall remain that of the flexible array member, even if this would differ from that of the replacement array. If this array would have no elements, it behaves as if it had one element but the behavior is undefined if any attempt is made to access that element or to generate a pointer one past it.

This allows us to write code as e.g.

struct vector {
    size_t n;
    int data[];
};

struct vector *v = malloc(sizeof(struct vector)
                          + sizeof(int [k]));

where k may be a non-statically known integer.

The size of a structure with a flexible array member is the same as the structure without a flexible array member except that it may have additional padding bytes.

struct s1 { char x; };
struct s2 { char x; double y[]; };

On my x86-64 machine the size of structure s1 is 1 byte whereas of s2 is 8 byte since a double must be 64 bit aligned. Note, a flexible array member has incomplete type. Therefore, the sizeof operator cannot be applied to such a structure member itself.

In C99 flexible array members are introduced. In contrast to C++17 which does not support flexible array members.

Food for Thought

Find the mistake! A program consisting of two compilation units is given. In the first compilation unit only a character array with 1024 elements is allocated.

char buf[1024];

In the second compilation unit the string “Hello World!” is copied into the buffer of the first compilation unit. Of course, the buffer is big enough for the string, i.e., that is not the mistake.

#include <stdio.h>
#include <string.h>

extern char *buf;

int main(void) {
    strcpy(buf, "Hello World!");
    printf("%s\n", buf);
    return 0;
}

No warning or error is raised during compilation. If we execute the program we run into a segmentation fault. What went wrong?