Why is there such a big difference between "Size" and "Size on disk"?

I will be assuming that you are using the FAT/FAT32 filesystem here, since you mention this is a SD card. NTFS and exFAT behave similarly with regards to allocation units. Other filesystems might be different, but they aren't supported on Windows anyway.

If you have a lot of small files, this is certainly possible. Consider this:

  • 50,000 files.

  • 32 kB cluster size (allocation units), which is the max for FAT32

Ok, now the minimum space taken is 50,000 * 32,000 = 1.6 GB (using SI prefixes, not binary, to simplify the maths). The space each file takes on the disk is always a multiple of the allocation unit size - and here we're assuming each file is actually small enough to fit within a single unit, with some (wasted) space left over.

If each file averaged 2 kB, you'd get about 100 MB total - but you're also wasting 15x that (30 kB per file) on average due to the allocation unit size.


In-depth explanation

Why does this happen? Well, the FAT32 filesystem needs to keep track of where each file is stored. If it were to keep a list of every single byte, the table (like an address book) would grow at the same speed as the data - and waste a lot of space. So what they do is use "allocation units", also known as the "cluster size". The volume is divided into these allocation units, and as far as the filesystem is concerned, they cannot be subdivided - those are the smallest blocks it can address. Much like you have a house number, but your postman doesn't care how many bedrooms you have or who lives in them.

So what happens if you have a very small file? Well, the filesystem doesn't care if the file is 0 kB, 2 kB or even 15 kB, it'll give it the least space it can - in the example above, that's 32 kB. Your file is only using a small amount of this space, and the rest is basically wasted, but still belongs to the file - much like a bedroom you leave unoccupied.

Why are there different allocation unit sizes? Well, it becomes a tradeoff between having a bigger table (address book, e.g. saying John owns a house at 123 Fake Street, 124 Fake Street, 666 Satan Lane, etc.), or more wasted space in each unit (house). If you have larger files, it makes more sense to use larger allocation units - because a file doesn't get a new unit (house) until all others are filled up. If you have lots of small files, well, you're going to have a big table (address book) anyway so may as well give them small units (houses).

Large allocation units, as a general rule, will waste a lot of space if you have lots of small files. There usually isn't a good reason to go above 4 kB for general use.


Fragmentation?

As for fragmentation, fragmentation shouldn't waste space in this manner. Large files may be fragmented, i.e. split up, into multiple allocation units, but each unit should be filled before the next one is started. Defragging might save a little space in the allocation tables, but this isn't your specific issue.


Possible solutions

As gladiator2345 suggested, your only real options at this point are to live with it or reformat with smaller allocation units.

Your card might be formatted in FAT16, which has a smaller limit on table size and therefore requires much larger allocation units in order to address a larger volume (with an upper limit of 2 GB with 32 kB allocation units). Source courtesy of Braiam. If that is the case, you should be able to safely format as FAT32 anyway.


This is one of those situations where compressing/archiving into a single file may help. What Bob said in his answer is true but the solution may be easier than reformating the disk as other answers suggests. If you compress or archive the directory (using zip, tar, or any other method) the file system will see that you have a single big file, instead of several smaller ones. Even without compressing you will be getting back almost 1.4 GiB of space back, because all those "small files" will be counted as a single big file.

Inside this, my maps app stores its cached maps and the app gets its map from Google Maps

Maybe you should discuss with the developer to use an archive or a database instead of multiple files. This probably will also help to have the disk less fragmented and will surely save space especially if it's a NAND flash drive. If you explain the ridiculous situation where 100MB of payload/useful data becomes 1.4GiB, there's something wrong with how the data is stored, and the developers should bring a nicer solution.


As already explained the most common reason for the size difference is used space vs. allocated space. But it's not the only possible one, NTFS has a feature to add hidden data to files. This possibility was the one exploited by the healthcare industry ransomware late 2019.

File fork and alternate data stream

"Resource fork" has been used by Apple since 1984 (Macintosh) to store the main content of a program (instructions) and the associated resources (like icons and menus) in the same file. Embedding resources in executable files is a common technique, but doing it with forks isn't.

Apple consistently designed the Macintosh file systems to support file forking, and when Microsoft designed NTFS to replace FAT, fork was also introduced under the name of "alternate data stream" (ADS).

In NTFS, a file contains:

  • The mandatory unnamed data stream (UDS)
  • One or more optional alternate data stream(s) (ADS).

Hidden in plain sight

File forking isn't bad, except NTFS ADS are not supported by common tools, including Windows Explorer, ADS is de facto a hidden feature, an unexpected gift for hackers. From Wikipedia:

Alternate streams are not listed in Windows Explorer, and their size is not included in the file's size.

While the file size, which reports only the UDS size, isn't changed by ADS existence, the allocated size (clusters allocated to the file by the file system) reports the actual size of the file, all streams included.

Windows explorer doesn't report ADS, neither the CMD command dir. However ADS are visible with:

  • Powershell Get-Item -Stream (Windows)
  • CMD dir /r (Windows)
  • streams (Microsoft/SysInternals)
  • lads (Heysoft)
  • AlternateStreamView (NirSoft)

Note it's still possible to hide ADS from some of these tools by using file system reserved keywords (see Pierce's document linked below).

  • Windows uses ADS to tag a file as downloaded from Internet and to store other metadata.

  • Hackers use ADS to hide data and code for malicious activities.

Comprehensive description of ADS worth reading:

  • by Sean Pierce
  • by Marc Ochsenmeier

Malware use of ADS

Serious anti-malware tools watch for ADS, but malware still uses ADS, at large scale, because:

  • Some security suites are not even ADS aware, or can't identify malicious uses of ADS.
  • It's easy to redirect the execution of a legitimate file to an ADS (e.g. using a shortcut).

BitPaymer

The ransomware BitPaymer enters the computer as a normal and visible file, but when executed copies itself in a legitimate file as an ADS, then delete the initial file. As this doesn't change the size of the legitimate file, and ADS are not listed by common tools, the malware is now virtually hidden.

Operation Cobalt Kitty

Also hides using ADS.

My point is: In case of big file size difference observed (more than a cluster size: 4KB), don't overlook the possibility of ADS, and hidden malware.

Experiment ADS yourself

To safely experiment with ADS, try this at DOS/CMD level...

Create and then display the content of a file in the root of C:

C:\> echo The main data stream> test.txt
C:\> type test.txt

Result:

C:\> The main data stream

Now add an ADS with the same method, just specify the ADS name in addition of the file name:

C:\> echo The secret message> test.txt:secret

You have just hidden the secret message in the file. Note that the file size in Explorer has not changed in spite we added bytes in the ADS "secret".

Try to display the ADS content:

C:\> type test.txt:secret

Result:

The filename, directory name, or volume label syntax is incorrect.

CMD type is not able to display the content of the ADS. We will use Notepad instead:

notepad test.txt:secret

In Notepad we can see the content of the ADS:

The secret message

You can also hide a full executable in an ADS of an innocent text file, and run it at any time. Wealth does not harm for hackers :-)

Tags:

Filesystems