Namenode file quantity limit

Each file or directory or block occupies about 150 bytes in the namenode memory. [1] So a cluster with a namenode with 32G RAM can support a maximum of (assuming namenode is the bottleneck) about 38 million files. (Each file will also take up a block, so each file takes 300 bytes in effect. I am also assuming 3x replication. So each file takes up 900 bytes)

In practice however, the number will be much lesser because all of the 32G will not be available to the namenode for keeping the mapping. You can increase it by allocating more heap space to the namenode in that machine.

Replication also effects this to a lesser degree. Each additional replica adds about 16 bytes to the memory requirement. [2]

[1] https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/

[2] http://search-hadoop.com/c/HDFS:/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfo.java%7C%7CBlockInfo


Cloudera recommends 1 GB of NameNode heap space per million blocks. 1 GB for every million files is less conservative but should work too.

Also you don't need to multiply by a replication factor, an accepted answer is wrong.

Using the default block size of 128 MB, a file of 192 MB is split into two block files, one 128 MB file and one 64 MB file. On the NameNode, namespace objects are measured by the number of files and blocks. The same 192 MB file is represented by three namespace objects (1 file inode + 2 blocks) and consumes approximately 450 bytes of memory.

One data file of 128 MB is represented by two namespace objects on the NameNode (1 file inode + 1 block) and consumes approximately 300 bytes of memory. By contrast, 128 files of 1 MB each are represented by 256 namespace objects (128 file inodes + 128 blocks) and consume approximately 38,400 bytes.

Replication affects disk space but not memory consumption. Replication changes the amount of storage required for each block but not the number of blocks. If one block file on a DataNode, represented by one block on the NameNode, is replicated three times, the number of block files is tripled but not the number of blocks that represent them.

Examples:

  1. 1 x 1024 MB file 1 file inode 8 blocks (1024 MB / 128 MB) Total = 9 objects * 150 bytes = 1,350 bytes of heap memory
  2. 8 x 128 MB files 8 file inodes 8 blocks Total = 16 objects * 150 bytes = 2,400 bytes of heap memory
  3. 1,024 x 1 MB files 1,024 file inodes 1,024 blocks Total = 2,048 objects * 150 bytes = 307,200 bytes of heap memory

Even more examples article in the origin article from cloudera.


(Each file metadata = 150bytes) + (block metadata for the file=150bytes)=300bytes so 1million files each with 1 block will consume=300*1000000=300000000bytes =300MB for replication factor of 1. with replication factor of 3 it requires 900MB.

So as thumb rule for every 1GB you can store 1million files.