Python difference between mutating and re-assigning a list ( _list = and _list[:] = )

It's hard to answer this canonically because the actual details are implementation-dependent or even type-dependent.

For example in CPython when an object reaches reference-count zero then it's disposed and the memory is freed immediately. However some types have an additional "pool" that references instances without you knowing it. For example CPython has a "pool" of unused list instances. When the last reference of a list is dropped in Python code it may be added to this "free list" instead of releasing the memory (one would need to invoke something PyList_ClearFreeList to reclaim that memory).

But a list is not just the memory that is needed for the list, a list contains objects. Even when the memory of the list is reclaimed the objects that were in the list could remain, for example there is still a reference to that object somewhere else, or that type itself has also a "free list".

If you look at other implementations like PyPy then even in the absence of a "pool" an object isn't disposed of immediately when no-one references it anymore, it's only disposed of "eventually".

So how does this relate to your examples you may wonder.

Let's have a look at your examples:

_list = [some_function(x) for x in _list]

Before this line runs there is one list instance assigned to the variable _list. Then you create a new list using the list-comprehension and assign it to the name _list. Shortly before this assign there are two lists in memory. The old list and the list created by the comprehension. After the assignment there is one list referenced by the name _list (the new list) and one list with a reference count that has been decremented by 1. In case the old list isn't referenced anywhere else and thus reached a reference count of 0, it may be returned to the pool, it may be disposed or it may be disposed eventually. Same for the contents of the old list.

What about the other example:

_list[:] = [some_function(x) for x in _list]

Before this line runs there is again one list assigned to the name _list. When the line executes it also creates a new list through the list comprehension. But instead of assigning the new list to the name _list it's going to replace the contents of the old list with those of the new list. However while it's clearing the old list it will have two lists that are kept in memory. After this assignment the old list is still available through the name _list but the list created by the list-comprehension isn't referenced anymore, it reaches a reference count of 0 and what happens to it depends. It can be put in the "pool" of free lists, it could be disposed immediately, it could also be disposed at some unknown point in the future. Same for the original contents of the old list which were cleared.

So where is the difference:

Actually there is not a lot of difference. In both cases Python has to keep two lists completely in memory. However the first approach will release the reference to the old list faster than the second approach will release the reference to the intermediate list in memory, simply because it has to be kept alive while the contents are copied.

However releasing the reference faster will not guarantee that it actually results in "less memory" since it might be returned to the pool or the implementation only frees memory at some (unknown) point in the future.

A less memory expensive alternative

Instead of creating and discarding lists you could chain iterators/generators and consume them when you need to iterate them (or you need the actual list).

So instead of doing:

_list = list(range(10)) # Or whatever
_list = [some_function(x) for x in _list]
_list = [some_other_function(x) for x in _list]

You could do:

def generate_values(it):
    for x in it:
        x = some_function(x)
        x = some_other_function(x)
        yield x

And then simply consume that:

for item in generate_values(range(10)):
    print(item)

Or consume it with a list:

list(generate_values(range(10)))

These will not (except when you pass it to list) create any lists at all. A generator is a state-machine that processes the elements one at a time when requested.

According to CPython documentation :

Some objects contain references to other objects; these are called containers. Examples of containers are tuples, lists and dictionaries. The references are part of a container’s value. In most cases, when we talk about the value of a container, we imply the values, not the identities of the contained objects; however, when we talk about the mutability of a container, only the identities of the immediately contained objects are implied.

So when a list is mutated, the references contained in the list are mutated, while the identity of the object is unchanged. Interestingly, while mutable objects with identical values are not allowed to have the same identity, identical immutable objects can have similar identity (because they are immutable!).

a = [1, 'hello world!']
b = [1, 'hello world!']
print([hex(id(_)) for _ in a])
print([hex(id(_)) for _ in b])
print(a is b)

#on my machine, I got:
#['0x55e210833380', '0x7faa5a3c0c70']
#['0x55e210833380', '0x7faa5a3c0c70']
#False

when code:

_list = [some_function(x) for x in _list]

is used, two new and old _lists with two different identities and values are created. Afterward, old _list is garbage collected. But when a container is mutated, every single value is retrieved, changed in CPU and updated one-by-one. So the list is not duplicated.

Regarding processing efficiency, its easily comparable:

import time

my_list = [_ for _ in range(1000000)]

start = time.time()
my_list[:] = [_ for _ in my_list]
print(time.time()-start)  # on my machine 0.0968618392944336 s


start = time.time()
my_list = [_ for _ in my_list]
print(time.time()-start)  # on my machine 0.05194497108459473 s

update: A list can be considered to be made of two parts: references to (id of) other objects and references value. I used a code to demonstrate the percentage of memory that a list object directly occupies to total memory consumed (list object + referred objects):

import sys
my_list = [str(_) for _ in range(10000)]

values_mem = 0
for item in my_list:
    values_mem+= sys.getsizeof(item)

list_mem = sys.getsizeof(my_list)

list_to_total = 100 * list_mem/(list_mem+values_mem)
print(list_to_total) #result ~ 14%

TLDR: You can't modify the list in-place in Python without doing some kind of loop yourself or using an external library, but it probably isn't worth trying for memory-saving reasons anyway (premature optimisation). What might be worth trying is using the Python map function and iterables, which don't store the results at all, but compute them on demand.

There are several ways to apply a modifying function across a list (i.e. performing a map) in Python, each with different implications for performance and side-effects:

New list

This is what both options in the question are actually doing.

[some_function(x) for x in _list]

This creates a new list, with values populated in order by running some_function on the corresponding value in _list. It can then be assigned as a replacement for the old list (_list = ...) or have its values replaces the old values, while keeping the object reference the same (_list[:] = ...). The former assignment happens in constant time and memory (it is just a reference replacement after all), where the second one has to iterate through the list to perform the assignment, which is linear in time. However, the time and memory required to create the list in the first place are both linear, so _list = ... is strictly faster than _list[:] = ..., but it's still linear in time and memory so it doesn't really matter.

From a functional point of view, the two variants of this option have potentially dangerous consequences through side-effects. _list = ... leaves the old list hanging around, which isn't dangerous, but does mean that memory might not be freed. Any other code references to _list will immediately get the new list after the change, which again is probably fine, but might cause subtle bugs if you're not paying attention. list[:] = ... changes the existing list, so anyone else with a reference to it will have the values change under their feet. Bear in mind that if the list is ever returned from a method, or passed outside the scope you're working in, you might not know who else is using it.

The bottom line is that both of these methods are linear in both time and memory because they copy the list, and have side-effects which need to be considered.

In-place substitution

The other possibility hinted at in the question is changing the values in place. This would save on the memory of a copy of the list. Unfortunately there's no built-in function for doing this in Python, but it's not difficult to do it manually (as offered in various answers to this question).

for i in range(len(_list)):
    _list[i] = some_function(_list[i])

Complexity-wise, this still has the linear time cost of performing the calls to some_function, but saves on the extra memory of keeping two lists. If it isn't referenced elsewhere, each item in the old list can be garbage collected as soon as it's been replaced.

Functionally, this is perhaps the most dangerous option, because the list is kept in an inconsistent state during the calls to some_function. As long as some_function makes no reference to the list (which would be pretty horrible design anyway), it should be as safe as the new list variety solutions. It also has the same dangers as the _list[:] = ... solution above, because the original list is being modified.

Iterables

The Python 3 map function acts on iterables rather than lists. Lists are iterables, but iterables aren't always lists, and when you call map(some_function, _list), it doesn't immediately run some_function at all. It only does it when you try to consume the iterable in some way.

list(map(some_other_function, map(some_function, _list)))

The code above applies some_function, followed by some_other_function to the elements of _list, and puts the results in a new list, but importantly, it doesn't store the intermediate value at all. If you only need to iterate on the results, or calculate a maximum from them, or some other reduce function, you won't need to store anything along the way.

This approach fits with the functional programming paradigm, which discourages side-effects (often the source of tricky bugs). Because the original list is never modified, even if some_function did make reference to it beyond the item it's considering at the time (which is still not good practice by the way), it wouldn't be affected by the ongoing map.

There are lots of functions for working with iterables and generators in the Python standard library itertools.

A note on parallelisation

It's very tempting to consider how performing a map on a list could be parallelised, to reduce the linear time cost of the calls to some_function by sharing it between multiple cpus. In principle, all of these methods can be parallelised, but Python makes it quite difficult to do. One way to do it is using the multiprocessing library, which has a map function. This answer describes how to use it.

Python difference between mutating and re-assigning a list ( _list = and _list[:] = )

So where is the difference:

A less memory expensive alternative

New list

In-place substitution

Iterables

A note on parallelisation

Tags:

Python

List

Mutation

Related

Recent Posts