How to minimize object size of a large list of strings

The documentation for ByteCount states that

ByteCount does not take account of any sharing of subexpressions. The results it gives assume that every part of the expression is stored separately. ByteCount will therefore often give an overestimate of the amount of memory currently needed to store a particular expression. When you manipulate the expression, however, subexpressions will often stop being shared, and the amount of memory needed will be close to the value returned by ByteCount.

So what you are measuring is what the byte count would be if repetition is not taken into account, but it does not tell you how much memory Mathematica actually allocates. For that we can use a different approach:

Clear[data];
mem = MemoryInUse[];
data = Table["1", 1*^6];
MemoryInUse[] - mem

8002664

Which is almost the same as R's 8,000,208 bytes.

If you explicitly tell Mathematica that this is a list a repeated elements you can save some more memory, but not a lot in this case:

Clear[data];
mem = MemoryInUse[];
data = ConstantArray["1", 1*^6];
MemoryInUse[] - mem

7999080


If the strings are repeated frequently and the order of the data is not critical, store the Tally of the data.

mem = MemoryInUse[];
data = Transpose[{Table["1", 5*^5], Table["2", 5*^5]}] // Flatten;
MemoryInUse[] - mem

(*  8003392  *)

mem = MemoryInUse[];
tally = Tally[data];
MemoryInUse[] - mem

(*  2656  *)

The sorted data can be reconstructed from the Tally

Flatten[Table @@@ tally] === (data // Sort)

(*  True  *)

Tags:

Memory