# Why do we call .detach() before calling .numpy() on a Pytorch Tensor?

I think the most crucial point to understand here is the *difference* between a `torch.tensor`

and `np.ndarray`

:

While both objects are used to store n-dimensional matrices (aka "Tensors"), `torch.tensors`

has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.

So, if you are only interested in efficient and easy way to perform mathematical operations on matrices `np.ndarray`

or `torch.tensor`

can be used interchangeably.

However, `torch.tensor`

s are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.

As mentioned before, `np.ndarray`

object does not have this extra "computational graph" layer and therefore, when converting a `torch.tensor`

to `np.ndarray`

you must *explicitly* remove the computational graph of the tensor using the `detach()`

command.

**Computational Graph**

From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.

Consider a simple function of two (vector) variables, `x`

and `w`

:

```
x = torch.rand(4, requires_grad=True)
w = torch.rand(4, requires_grad=True)
y = x @ w # inner-product of x and w
z = y ** 2 # square the inner product
```

If we are only interested in the value of `z`

, we need not worry about any graphs, we simply moving *forward* from the inputs, `x`

and `w`

, to compute `y`

and then `z`

.

However, what would happen if we do not care so much about the value of `z`

, but rather want to ask the question *"what is w that minimizes z for a given x"?*

To answer that question, we need to compute the

*derivative*of

`z`

w.r.t `w`

.How can we do that?

Using the chain rule we know that

`dz/dw = dz/dy * dy/dw`

. That is, to compute the gradient of `z`

w.r.t `w`

we need to move *backward*from

`z`

back to `w`

computing the *gradient*of the operation at each step as we trace

*back*our steps from

`z`

to `w`

. This "path" we trace back is the *computational graph*of

`z`

and it tells us how to compute the derivative of `z`

w.r.t the inputs leading to `z`

:```
z.backward() # ask pytorch to trace back the computation of z
```

We can now inspect the gradient of `z`

w.r.t `w`

:

`w.grad # the resulting gradient of z w.r.t w tensor([0.8010, 1.9746, 1.5904, 1.0408])`

Note that this is exactly equals to

`2*y*x tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)`

since `dz/dy = 2*y`

and `dy/dw = x`

.

Each tensor along the path stores its "contribution" to the computation:

`z tensor(1.4061, grad_fn=<PowBackward0>)`

And

`y tensor(1.1858, grad_fn=<DotBackward>)`

As you can see, `y`

and `z`

stores not only the "forward" value of `<x, w>`

or `y**2`

but also the *computational graph* -- the `grad_fn`

that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from `z`

(output) to `w`

(inputs).

These `grad_fn`

are essential components to `torch.tensors`

and without them one cannot compute derivatives of complicated functions. However, `np.ndarray`

s do not have this capability at all and they do not have this information.

please see this answer for more information on tracing back the derivative using `backwrd()`

function.

Since both `np.ndarray`

and `torch.tensor`

has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:

`numpy() → numpy.ndarray`

Returns`self`

tensor as a NumPy ndarray. This tensor and the returned ndarrayshare the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.

The other direction works in the same way as well:

`torch.from_numpy(ndarray) → Tensor`

Creates a Tensor from a numpy.ndarray.

The returned tensor and ndarrayshare the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.

Thus, when creating an `np.array`

from `torch.tensor`

or vice versa, both object *reference* the same underlying storage in memory. Since `np.ndarray`

does not store/represent the computational graph associated with the array, this graph should be *explicitly* removed using `detach()`

when sharing both numpy and torch wish to reference the same tensor.

Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use `with torch.no_grad()`

context manager, in which case computational graphs are not created and `torch.tensor`

s and `np.ndarray`

s can be used interchangeably.

```
with torch.no_grad():
x_t = torch.rand(3,4)
y_np = np.ones((4, 2), dtype=np.float32)
x_t @ torch.from_numpy(y_np) # dot product in torch
np.dot(x_t.numpy(), y_np) # the same dot product in numpy
```

I asked, **Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?**

Yes, the new tensor will not be connected to the old tensor through a `grad_fn`

, and so any operations on the new tensor will not carry gradients back to the old tensor.

Writing `my_tensor.detach().numpy()`

is simply saying, "I'm going to do some non-tracked computations based on the value of this tensor in a numpy array."

The Dive into Deep Learning (d2l) textbook has a nice section describing the detach() method, although it doesn't talk about why a detach makes sense before converting to a numpy array.

Thanks to jodag for helping to answer this question. As he said, Variables are obsolete, so we can ignore that comment.

I think the best answer I can find so far is in jodag's doc link:

To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.

and in albanD's remarks that I quoted in the question:

If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.

In other words, the `detach`

method means "I don't want gradients," and it is impossible to track gradients through `numpy`

operations (after all, that is what PyTorch tensors are for!)