Millions of (small) text files in a folder

This is perilously close to an opinion-based question/answer but I'll try to provide some facts with my opinions.

If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
ls will take longer to enumerate a very large number of files than a small number of files.
The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

If you are serving up these files from a webserver (I'm not entirely sure whether that's relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we're talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

This can be solved by doing your wildcard operations using the find command:

find <directory> -name '<wildcard expression>' -exec <command> {} \+

or a similar syntax whenever possible. The find ... -exec ... \+ will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

I ran into the following challenges.

ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
I wouldn't risk trying to use the zip file on Windows.
The folder quickly became a no humans allowed zone.

I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

For example, if you have a file named 384295.txt the path would be:

/mnt/file/300000/80000/4000/295.txt

If you know you'll reach a few million. Use 0 prefixes for 1,000,000

/mnt/file/000000/300000/80000/4000/295.txt

Millions of (small) text files in a folder

Tags:

Performance

Filesystems

Ext4

Files

Related

Recent Posts