Does string slicing perform copy in memory?

Possible talking point (feel free to edit adding information).

I have just written this test to verify empirically what the answer to the question might be (this cannot and does not want to be a certain answer).

import sys

a = "abcdefg"

print("a id:", id(a))
print("a[2:] id:", id(a[2:]))
print("a[2:] is a:", a[2:] is a)

print("Empty string memory size:", sys.getsizeof(""))
print("a memory size:", sys.getsizeof(a))
print("a[2:] memory size:", sys.getsizeof(a[2:]))

Output:

a id: 139796109961712
a[2:] id: 139796109962160
a[2:] is a: False
Empty string memory size: 49
a memory size: 56
a[2:] memory size: 54

As we can see here:

the size of an empty string object is 49 bytes
a single character occupies 1 byte (Latin-1 encoding)
a and a[2:] ids are different
the occupied memory of each a and a[2:] is consistent with the memory occupied by a string with that number of char assigned

String slicing makes a copy in CPython.

Looking in the source, this operation is handled in unicodeobject.c:unicode_subscript. There is evidently a special-case to re-use memory when the step is 1, start is 0, and the entire content of the string is sliced - this goes into unicode_result_unchanged and there will not be a copy. However, the general case calls PyUnicode_Substring where all roads lead to a memcpy.

To empirically verify these claims, you can use a stdlib memory profiling tool tracemalloc:

# s.py
import tracemalloc

tracemalloc.start()
before = tracemalloc.take_snapshot()
a = "." * 7 * 1024**2  # 7 MB of .....   # line 6, first alloc
b = a[1:]                                # line 7, second alloc
after = tracemalloc.take_snapshot()

for stat in after.compare_to(before, 'lineno')[:2]:
    print(stat)

You should see the top two statistics output like this:

/tmp/s.py:6: size=7168 KiB (+7168 KiB), count=1 (+1), average=7168 KiB
/tmp/s.py:7: size=7168 KiB (+7168 KiB), count=1 (+1), average=7168 KiB

This result shows two allocations of 7 meg, strong evidence of the memory copying, and the exact line numbers of those allocations will be indicated.

Try changing the slice from b = a[1:] into b = a[0:] to see that entire-string-special-case in effect: there should be only one large allocation now, and sys.getrefcount(a) will increase by one.

In theory, since strings are immutable, an implementation could re-use memory for substring slices. This would likely complicate any reference-counting based garbage collection process, so it may not be a useful idea in practice. Consider the case where a small slice from a much larger string was taken - unless you implemented some kind of sub-reference counting on the slice, the memory from the much larger string could not be freed until the end of the substring's lifetime.

For users that specifically need a standard type which can be sliced without copying the underlying data, there is memoryview. See What exactly is the point of memoryview in Python for more information about that.

Does string slicing perform copy in memory?

Tags:

Python

Python 3.X

Related

Recent Posts