Pytorch, what are the gradient arguments

The original code I haven't found on PyTorch website anymore.

gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)

The problem with the code above is there is no function based on how to calculate the gradients. This means we don't know how many parameters (arguments the function takes) and the dimension of parameters.

To fully understand this I created an example close to the original:

Example 1:

a = torch.tensor([1.0, 2.0, 3.0], requires_grad = True)
b = torch.tensor([3.0, 4.0, 5.0], requires_grad = True)
c = torch.tensor([6.0, 7.0, 8.0], requires_grad = True)

y=3*a + 2*b*b + torch.log(c)    
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients,retain_graph=True)    

print(a.grad) # tensor([3.0000e-01, 3.0000e+00, 3.0000e-04])
print(b.grad) # tensor([1.2000e+00, 1.6000e+01, 2.0000e-03])
print(c.grad) # tensor([1.6667e-02, 1.4286e-01, 1.2500e-05])

I assumed our function is y=3*a + 2*b*b + torch.log(c) and the parameters are tensors with three elements inside.

You can think of the gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) like this is the accumulator.

As you may hear, PyTorch autograd system calculation is equivalent to Jacobian product.

Jacobian

In case you have a function, like we did:

y=3*a + 2*b*b + torch.log(c)

Jacobian would be [3, 4*b, 1/c]. However, this Jacobian is not how PyTorch is doing things to calculate the gradients at a certain point.

PyTorch uses forward pass and backward mode automatic differentiation (AD) in tandem.

There is no symbolic math involved and no numerical differentiation.

Numerical differentiation would be to calculate δy/δb, for b=1 and b=1+ε where ε is small.

If you don't use gradients in y.backward():

Example 2

a = torch.tensor(0.1, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(0.1, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)

y.backward()

print(a.grad) # tensor(3.)
print(b.grad) # tensor(4.)
print(c.grad) # tensor(10.)

You will simply get the result at a point, based on how you set your a, b, c tensors initially.

Be careful how you initialize your a, b, c:

Example 3:

a = torch.empty(1, requires_grad = True, pin_memory=True)
b = torch.empty(1, requires_grad = True, pin_memory=True)
c = torch.empty(1, requires_grad = True, pin_memory=True)

y=3*a + 2*b*b + torch.log(c)

gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)

print(a.grad) # tensor([3.3003])
print(b.grad) # tensor([0.])
print(c.grad) # tensor([inf])

If you use torch.empty() and don't use pin_memory=True you may have different results each time.

Also, note gradients are like accumulators so zero them when needed.

Example 4:

a = torch.tensor(1.0, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(1.0, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)

y.backward(retain_graph=True)
y.backward()

print(a.grad) # tensor(6.)
print(b.grad) # tensor(8.)
print(c.grad) # tensor(2.)

Lastly few tips on terms PyTorch uses:

PyTorch creates a dynamic computational graph when calculating the gradients in forward pass. This looks much like a tree.

So you will often hear the leaves of this tree are input tensors and the root is output tensor.

Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule. This multiplying occurs in the backward pass.

Back some time I created PyTorch Automatic Differentiation tutorial that you may check interesting explaining all the tiny details about AD.


Here, the output of forward(), i.e. y is a a 3-vector.

The three values are the gradients at the output of the network. They are usually set to 1.0 if y is the final output, but can have other values as well, especially if y is part of a bigger network.

For eg. if x is the input, y = [y1, y2, y3] is an intermediate output which is used to compute the final output z,

Then,

dz/dx = dz/dy1 * dy1/dx + dz/dy2 * dy2/dx + dz/dy3 * dy3/dx

So here, the three values to backward are

[dz/dy1, dz/dy2, dz/dy3]

and then backward() computes dz/dx


Explanation

For neural networks, we usually use loss to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value. In order to update the parameters of the network, we need to calculate the gradient of loss w.r.t to the parameters, which is actually leaf node in the computation graph (by the way, these parameters are mostly the weight and bias of various layers such Convolution, Linear and so on).

According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up.

The gradient arguments of a Variable's backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weight is just the derivate of final loss w.r.t each element of the intermediate variable.

A concrete example

Let's take a concrete and simple example to understand this.

from torch.autograd import Variable
import torch
x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1)

# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated

# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()

# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()

# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)    

In the above example, the outcome of first print is

2 0 0 0
[torch.FloatTensor of size 1x4]

which is exactly the derivative of z_1 w.r.t to x.

The outcome of second print is :

0 2 0 0
[torch.FloatTensor of size 1x4]

which is the derivative of z_2 w.r.t to x.

Now if use a weight of [1, 1, 1, 1] to calculate the derivative of z w.r.t to x, the outcome is 1*dz_1/dx + 1*dz_2/dx + 1*dz_3/dx + 1*dz_4/dx. So no surprisingly, the output of 3rd print is:

2 2 2 2
[torch.FloatTensor of size 1x4]

It should be noted that weight vector [1, 1, 1, 1] is exactly derivative of loss w.r.t to z_1, z_2, z_3 and z_4. The derivative of loss w.r.t to x is calculated as:

d(loss)/dx = d(loss)/dz_1 * dz_1/dx + d(loss)/dz_2 * dz_2/dx + d(loss)/dz_3 * dz_3/dx + d(loss)/dz_4 * dz_4/dx

So the output of 4th print is the same as the 3rd print:

2 2 2 2
[torch.FloatTensor of size 1x4]


Typically, your computational graph has one scalar output says loss. Then you can compute the gradient of loss w.r.t. the weights (w) by loss.backward(). Where the default argument of backward() is 1.0.

If your output has multiple values (e.g. loss=[loss1, loss2, loss3]), you can compute the gradients of loss w.r.t. the weights by loss.backward(torch.FloatTensor([1.0, 1.0, 1.0])).

Furthermore, if you want to add weights or importances to different losses, you can use loss.backward(torch.FloatTensor([-0.1, 1.0, 0.0001])).

This means to calculate -0.1*d(loss1)/dw, d(loss2)/dw, 0.0001*d(loss3)/dw simultaneously.