What's the faster glUniform4f/glUniform4fv with consideration all kind of optimization?

While the *v variant mainly exists for setting uniforms that are of array type, the OpenGL specification explicitly allows you to use the array variants also for setting scalar values by using a count of one.

Let me quote of the OpenGL Spec (emphasizes added myself):

The commands glUniform{1|2|3|4}{f|i}v can be used to modify a single uniform variable or a uniform variable array. These commands pass a count and a pointer to the values to be loaded into a uniform variable or a uniform variable array. A count of 1 should be used if modifying the value of a single uniform variable, and a count of 1 or greater can be used to modify an entire array or part of an array.

This is from the OpenGL 2.1 Spec, however it reads the same for the OpenGL 4.2 Spec.

Actually the other way round is allowed, too. Assume you have a uniform of type vec3 v[2] and you query its location using glGetUniformLocation(), it may return 6. That means that 6 is actually the location of v[0].


Now back to the initial question: Which variant is faster?

It's impossible to tell. They might be equally fast or one might be faster than the other one, this is very implementation dependent. Actually I would expect most implementations implement one of both on top of the other one.

E.g. consider the following code:

void glUniform1f ( GLint location, GLfloat v0 ) {
    glUniform1fv(location, 1, &v0);
}

In that case the array variant would be faster. However, the following variant is possible as well:

void glUniform1fv ( GLint location, GLsizei count, GLfloat * value ) {
    int i;

    for (i = 0; i < count; i++) {
        glUniform1f(location, *value);
        value++;
        location++;
    }
}

In that case the non-array variant would be faster.

Personally I would say (and this is solely my personal opinion) that early OpenGL implementations may have implemented the array variants using the non-array variants, as that's the simpler implementation, with little to no other modifications throughout the rest of the OpenGL library. On the other hand, it is also the much slower implementation, since it involves a loop that is most likely unnecessary with modern graphics adapters and thus modern implementations most likely implement the non-array variants on top of the array variants.

The array variants have other advantages. Consider the following function:

struct v3 {
     GLfloat x;
     GLfloat y;
     GLfloat z;
};

void setUniform ( GLint location, struct v3 * vPtr ) {
    glUniform3f(location, vPtr->x, vPtr->y, vPtr->z);
}

Dereferencing vPtr three times just to call a non-array function is rather stupid and hardly ever faster than the following implementation:

void setUniform ( GLint location, struct v3 * vPtr ) {
    glUniform3fv(location, 1, (const GLfloat *)vPtr);
}

Also all array variants always have exactly three parameters, while the other variants can have up to five. The more arguments you need to pass to a function the slower the function call itself will get when those arguments are passed via stack instead of within registers. And more arguments a function call has, the less likely those can all be passed within registers for architectures with a hybrid call scheme. So going by the pure function call overhead you can expect on common CPUs, a call to a function with little arguments is quite often faster than a call to a function with many arguments, though this difference will only matter if you perform several thousand calls a second, which is usually not the case for uniform values.


Ask yourself why a *v version exists in the first place. To understand the answer, you need to know how CPUs talk to GPUs.

Modern day computers communicate with the GPU via Dynamic Memory Access. This is another piece of hardware that bulk moves memory from one device to another. You say, move everything from void* a to void* b, and it goes away and does it (without the CPU).
However, when you call an open-gl function, what it actually does is write a command/data into a special pre-allocated block of memory, called a command list. At some point later it then tells the DMA-controller to move that. It work like this because...
a) You may change the memory contents before the DMA-transfer ends. Copying the memory stops any potential race conditions.
b) Virtual memory means that what looks like a block of memory to a computer program, might actually be scattered (in pages) in real physical memory. Command lists are guaranteed to have the property that they are a block in physical memory.

For most implementations, one call should map to one command in the command-list. If so, if a command is 4-bytes and you are setting a float at time ... calling a function per-element means the command buffer is 50% commands, 50% data. Calling a vector function means there is only one command for the whole vector, and the command buffer is almost entirely data.

So... the glUniform*v exist, because the non-vector version are as low as 50% efficient on the majority of implementations. This is not a surprise. If an API supplies a function, it is usually because it is either not possible, or prohibitively expensive to achieve the same thing any other way.


They aren't comparable. The former sets a uniform variable which is a scalar (by which i mean a single 4-vector), the latter sets one which is an array (of 4-vectors).

It might be possible to treat an array of length 1 as a scalar, and vice versa, but it would be perverse. Thus, you are never in a situation where you have a choice between the two.

If you're really talking about the decision between using a single array variable and several scalars, then as long as you access all of them homogenously (ie do the same kinds of computations with them), i would imagine an array would be faster, because if you use multiple scalars, you have to do all the array arithmetic yourself, and it's likely the hardware will be better at that than you are.

Tags:

Optimization