What is the rationale of making subtraction of two pointers not related to the same array undefined behavior?

Speaking more academically: pointers are not numbers. They are pointers.

It is true that a pointer on your system is implemented as a numerical representation of an address-like representation of a location in some abstract kind of memory (probably a virtual, per-process memory space).

But C++ doesn't care about that. C++ wants you to think of pointers as post-its, as bookmarks, to specific objects. The numerical address values are just a side-effect. The only arithmetic that makes sense on a pointer is forwards and backwards through an array of objects; nothing else is philosophically meaningful.

This may seem pretty arcane and useless, but it's actually deliberate and useful. C++ doesn't want to constrain implementations to imbuing further meaning to practical, low-level computer properties that it cannot control. And, since there is no reason for it to do so (why would you want to do this?) it just says that the result is undefined.

In practice you may find that your subtraction works. However, compilers are extremely complicated and make great use of the standard's rules in order to generate the fastest code possible; that can and often will result in your program appearing to do strange things when you break the rules. Don't be too surprised if your pointer arithmetic operation is mangled when the compiler assumes that both the originating value and the result refer to the same array — an assumption that you violated.

As noted by some in the comments, unless the resulting value has some meaning or usable in some way, there is no point in making the behavior defined.

There has been a study done for the C language to answer questions related to Pointer Provenance (and with an intention to propose wording changes to the C specification.) and one of the questions was:

Can one make a usable offset between two separately allocated objects by inter-object subtraction (using either pointer or integer arithmetic), to make a usable pointer to the second by adding the offset to the first? (source)

The conclusion of the authors of the study was published in a paper titled: Exploring C Semantics and Pointer Provenance and with respect to this particular question, the answer was:

Inter-object pointer arithmetic The first example in this section relied on guessing (and then checking) the offset between two allocations. What if one instead calculates the offset, with pointer subtraction; should that let one move between objects, as below?
// pointer_offset_from_ptr_subtraction_global_xy.c
#include <stdio.h>
#include <string.h>
#include <stddef.h>

int x=1, y=2;
int main() {
    int *p = &x;
    int *q = &y;
    ptrdiff_t offset = q - p;
    int *r = p + offset;
    if (memcmp(&r, &q, sizeof(r)) == 0) {
        *r = 11; // is this free of UB?
        printf("y=%d *q=%d *r=%d\n",y,*q,*r);
    }
}
In ISO C11, the q-p is UB (as a pointer subtraction between pointers to different objects, which in some abstract-machine executions are not one-past-related). In a variant semantics that allows construction of more-than-one-past pointers, one would have to to choose whether the *r=11 access is UB or not. The basic provenance semantics will forbid it, because r will retain the provenance of the x allocation, but its address is not in bounds for that. This is probably the most desirable semantics: we have found very few example idioms that intentionally use inter-object pointer arithmetic, and the freedom that forbidding it gives to alias analysis and optimisation seems significant.

This study was picked up by the C++ community, summarized and was sent to WG21 (The C++ Standards Committee) for feedback.

Relevant point of the Summary:

Pointer difference is only defined for pointers with the same provenance and within the same array.

So, they have decided to keep it undefined for now.

Note that there is a study group SG12 within the C++ Standards Committee for studying Undefined Behavior & Vulnerabilities. This group conducts a systematic review to catalog cases of vulnerabilities and undefined/unspecified behavior in the standard, and recommend a coherent set of changes to define and/or specify the behavior. You can keep track of the proceedings of this group to see if there are going to be any changes in the future to the behaviors that are currently undefined or unspecified.

First see this question mentioned in the comments for why it isn't well defined. The answer given concisely is that arbitrary pointer arithmetic is not possible in segmented memory models used by some (now archaic?) systems.

What is the rationale to make such behavior undefined instead of, for instance, implementation defined?

Whenever standard specifies something as undefined behaviour, it usually could be specified merely to be implementation defined instead. So, why specify anything as undefined?

Well, undefined behaviour is more lenient. In particular, being allowed to assume that there is no undefined behaviour, a compiler may perform optimisations that would break the program if the assumptions weren't correct. So, a reason to specify undefined behaviour is optimisation.

Let's consider function fun(int* arr1, int* arr2) that takes two pointers as arguments. Those pointers could point to the same array, or not. Let's say the function iterates through one of the pointed arrays (arr1 + n), and must compare each position to the other pointer for equality ((arr1 + n) != arr2) in each iteration. For example to ensure that the pointed object is not overridden.

Let's say that we call the function like this: fun(array1, array2). The compiler knows that (array1 + n) != array2, because otherwise behaviour is undefined. Therefore the if the function call is expanded inline, the compiler can remove the redundant check (arr1 + n) != arr2 which is always true. If pointer arithmetic across array boundaries were well (or even implementation) defined, then (array1 + n) == array2 could be true with some n, and this optimisation would be impossible - unless the compiler can prove that (array1 + n) != array2 holds for all possible values of n which can sometimes be more difficult to prove.

Pointer arithmetic across members of a class could be implemented even in segmented memory models. Same goes for iterating over the boundaries of a subarray. There are use cases where these could be quite useful, but these are technically UB.

An argument for UB in these cases is more possibilities for UB optimisation. You don't necessarily need to agree that this is a sufficient argument.

What is the rationale of making subtraction of two pointers not related to the same array undefined behavior?

Tags:

C++

Language Lawyer

Pointer Arithmetic

Related

Recent Posts