Reading a file as a ByteArray?

Import[..., "String"] and Export[..., "String"] are meant precisely for this and will not cause problems. This is guaranteed to give you "character codes" between 0..255, and the string can represent the file contents exactly.

This differs significantly from Import[..., "Text"] which will handle character encodings, line endings, etc. and is meant for text, not for arbitrary binary data.


I have not used ByteArray and I am not sure about its purpose, but I got the impression that it is meant for the cryptography functionality as I couldn't find any other high-level functions that work with it. It does work with basic list manipulation functions though.

I know I can later apply ByteArray to the result of Import, but I would prefer to not use so much memory intermittently.

How about reading in chunks, and packing each chunk into a ByteArray, to avoid high memory usage?

chunkSize = 300*1024; (* 300 kB due to the size of my test file *)

stream = OpenRead["file.pdf", BinaryFormat -> True];

ba = Join @@ First@Last@Reap@While[True,
      res = BinaryReadList[stream, "Byte", chunkSize];
      If[res === {}, Break[]]; (* there was no more data to read *)
      Sow[ByteArray[res]]
     ]

Close[stream]

You can use the new in M11.3 function ReadByteArray:

path = ExampleData[{"TestImage", "Mandrill"}, "FilePath"];
MaxMemoryUsed[ba = ReadByteArray[path]] //AbsoluteTiming
Head[ba]

{0.000439, 723768}

ByteArray


It seems that using new in version 11.2 StringToByteArray you can reduce the memory requirements:

file = ExampleData[{"TestImage", "Mandrill"}, "FilePath"];

byteList = BinaryReadList[path];
string = Import[path, "String"];
byteList // ByteCount
string // ByteCount
byteArray = StringToByteArray[string]; // MaxMemoryUsed
byteArray    
5024024

945208

945504

ByteArrray

As one can see from the above, String requires 5 times lesser memory than packed array of integers returned by BinaryReadList while StringToByteArray (without second argument) takes only a tiny amount of additional memory.

A more memory efficient representation of the data can be obtained by specifying "ISO8859-1" encoding by the cost of increasing intermediate memory requirements:

byteArray2 = StringToByteArray[string, "ISO8859-1"]; // MaxMemoryUsed
byteArray2
1573240

ByteArray

Hence it would be useful to have an option to import a file as ByteArrray directly.


String does store multibyte variable length characters (utf-8 style), right? It's not just a wchar_t/short/int array, is it?

Citing Itai Seggev:

We internally encode strings in a variant of UTF-8. Now, of course, any byte can be faithfully converted to/from ISO8859-1, but that encoding only equals UTF-8 for the lower 7 bits. For other values, you need to use multiple bytes per character. So using a string to store byte data is both less space efficient and time efficient (since you need to ensure to correct conversion between the two encodings.)