Why does Python copy NumPy arrays where the length of the dimensions are the same?

In [1]: a = [np.array([0.0, 0.2, 0.4, 0.6, 0.8]), 
   ...:      np.array([0.0, 0.2, 0.4, 0.6, 0.8]), 
   ...:      np.array([0.0, 0.2, 0.4, 0.6, 0.8])]                               
In [2]:                                                                         
In [2]: a                                                                       
Out[2]: 
[array([0. , 0.2, 0.4, 0.6, 0.8]),
 array([0. , 0.2, 0.4, 0.6, 0.8]),
 array([0. , 0.2, 0.4, 0.6, 0.8])]

a is a list of arrays. b is a 2d array.

In [3]: b = np.array(a)                                                         
In [4]: b                                                                       
Out[4]: 
array([[0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8]])
In [5]: b[0] += 1                                                               
In [6]: b                                                                       
Out[6]: 
array([[1. , 1.2, 1.4, 1.6, 1.8],
       [0. , 0.2, 0.4, 0.6, 0.8],
       [0. , 0.2, 0.4, 0.6, 0.8]])

b gets values from a but does not contain any of the a objects. The underlying data structure of this b is very different from a, the list. If that isn't clear, you may want to review the numpy basics (which talk about shape, strides, and data buffers).

In the second case, b is an object array, containing the same objects as a:

In [8]: b = np.array(a)                                                         
In [9]: b                                                                       
Out[9]: 
array([array([0. , 0.2, 0.4, 0.6, 0.8]), array([0. , 0.2, 0.4, 0.6, 0.8]),
       array([0. , 0.2, 0.4, 0.6])], dtype=object)

This b behaves a lot like the a - both contain arrays.

The construction of this object array is quite different from the 2d numeric array. I think of the numeric array as the default, or normal, numpy behavior, while the object array is a 'concession', giving us a useful tool, but one which does not have the calculation power of the multidimensional array.

It is easy to make an object array by mistake - some say too easy. It can be harder to make one reliably by design. FOr example with the original a, we have to do:

In [17]: b = np.empty(3, object)                                                
In [18]: b[:] = a[:]                                                            
In [19]: b                                                                      
Out[19]: 
array([array([0. , 0.2, 0.4, 0.6, 0.8]), array([0. , 0.2, 0.4, 0.6, 0.8]),
       array([0. , 0.2, 0.4, 0.6, 0.8])], dtype=object)

or even for i in range(3): b[i] = a[i]


In a nutshell, this is a consequence of your data. You'll notice that this works/does not work (depending on how you view it) because your arrays are not equally sized.

With equal sized sub-arrays, the elements can be compactly loaded into a memory efficient scheme where any N-D array can be represented by a compact 1-D array in memory. NumPy then handles the translation of multi-dimensional indexes to 1D indexes internally. For example, index [i, j] of a 2D array will map to i*N + j (if storing in row major format). The data from the original list of arrays is copied into a compact 1D array, so any modifications made to this array does not affect the original.

With ragged lists/arrays, this cannot be done. The array is effectively a python list, where each element is a python object. For efficiency, only the object references are copied and not the data. This is why you can mutate the original list elements in the second case but not the first.


In the first case, NumPy sees that the input to numpy.array can be interpreted as a 3x5, 2-dimensional array-like, so it does that. The result is a new array of float64 dtype, with the input data copied into it, independent of the input object. b[0] is a view of the new array's first row, completely independent of a[0], and modifying b[0] does not affect a[0].

In the second case, since the lengths of the subarrays are unequal, the input cannot be interpreted as a 2-dimensional array-like. However, considering the subarrays as opaque objects, the list can be interpreted as a 1-dimensional array-like of objects, which is the interpretation NumPy falls back on. The result of the numpy.array call is a 1-dimensional array of object dtype, containing references to the array objects that were elements of the input list. b[0] is the same array object that a[0] is, and b[0] += 1 mutates that object.

This length dependence is one of the many reasons that trying to make jagged arrays or arrays of arrays is a really, really bad idea in NumPy. Seriously, don't do it.

Tags:

Python

Numpy