About Vectors growth

The rate at which the capacity of a vector grows is implementation dependent. Implementations almost invariably choose exponential growth, in order to meet the amortized constant time requirement for the push_back operation. What amortized constant time means and how exponential growth achieves this is interesting.

Every time a vector's capacity is grown the elements need to be copied. If you 'amortize' this cost out over the lifetime of the vector, it turns out that if you increase the capacity by an exponential factor you end up with an amortized constant cost.

This probably seems a bit odd, so let me explain to you how this works...

  • size: 1 capacity 1 - No elements have been copied, the cost per element for copies is 0.
  • size: 2 capacity 2 - When the vector's capacity was increased to 2, the first element had to be copied. Average copies per element is 0.5
  • size: 3 capacity 4 - When the vector's capacity was increased to 4, the first two elements had to be copied. Average copies per element is (2 + 1 + 0) / 3 = 1.
  • size: 4 capacity 4 - Average copies per element is (2 + 1 + 0 + 0) / 4 = 3 / 4 = 0.75.
  • size: 5 capacity 8 - Average copies per element is (3 + 2 + 1 + 1 + 0) / 5 = 7 / 5 = 1.4
  • ...
  • size: 8 capacity 8 - Average copies per element is (3 + 2 + 1 + 1 + 0 + 0 + 0 + 0) / 8 = 7 / 8 = 0.875
  • size: 9 capacity 16 - Average copies per element is (4 + 3 + 2 + 2 + 1 + 1 + 1 + 1 + 0) / 9 = 15 / 9 = 1.67
  • ...
  • size 16 capacity 16 - Average copies per element is 15 / 16 = 0.938
  • size 17 capacity 32 - Average copies per element is 31 / 17 = 1.82

As you can see, every time the capacity jumps, the number of copies goes up by the previous size of the array. But because the array has to double in size before the capacity jumps again, the number of copies per element always stays less than 2.

If you increased the capacity by 1.5 * N instead of by 2 * N, you would end up with a very similar effect, except the upper bound on the copies per element would be higher (I think it would be 3).

I suspect an implementation would choose 1.5 over 2 both to save a bit of space, but also because 1.5 is closer to the golden ratio. I have an intuition (that is currently not backed up by any hard data) that a growth rate in line with the golden ratio (because of its relationship to the fibonacci sequence) will prove to be the most efficient growth rate for real-world loads in terms of minimizing both extra space used and time.


To be able to provide amortized constant time insertions at the end of the std::vector, the implementation must grow the size of the vector (when needed) by a factor K>1 (*), such that when trying to append to a vector of size N that is full, the vector grows to be K*N.

Different implementations use different constants K that provide different benefits, in particular most implementations go for either K = 2 or K = 1.5. A higher K will make it faster as it will require less grows, but it will at the same time have a greater memory impact. As an example, in gcc K = 2, while in VS (Dinkumware) K = 1.5.

(*) If the vector grew by a constant quantity, then the complexity of push_back would become linear instead of amortized constant. For example, if the vector grew by 10 elements when needed, the cost of growing (copy of all element to the new memory address) would be O( N / 10 ) (every 10 elements, move everything) or O( N ).


Just to add some mathematic proof on the time complexity on vector::push_back, say the size of vector is n, what we care about here is the number of copies happened so far, say y, notice the copy happens every time you grow the vector.

Grow by factor of K

  y = K^1 + K^2 + K^3 ... K^log(K, n)
K*y =     + K^2 + K^3 ... K^log(K, n) + K*K^log(K, n)

K*y-y = K*K^log(K, n) - K
y = K(n-1)/(K-1) = (K/(K-1))(n-1)

T(n) = y/n = (K/(K-1)) * (n-1)/n < K/(K-1) = O(1)

K/(K-1) is a constant, and see the most common cases:

  • K=2, T(n) = 2/(2-1) = 2
  • K=1.5, T(n) = 1.5/(1.5-1) = 3

and actually there is a reason of choosing K as 1.5 or 2 in different implementations, see this graph: as T(n) reaching the minimum when K is around 2, there is not much benefit on using a larger K, at the cost of allocating more memory

Grow by constant quantity of C

y = C + 2*C + 3*C + 4*C +  ... (n/C) * C
  = C(1+2+3+...+n/C), say m = n/C
  = C*(m*(m-1)/2)
  = n(m-1)/2

T(n) = y/n = (n(m-1)/2)/n = (m-1)/2 = n/2C - 1/2 = O(n)

As we could see it is liner

Tags:

C++

Visual C++