Running out of memory running fsck on large filesystems

Solution 1:

A 64 bit kernel and large quantities of RAM will allow the fsck to finish nice and fast. Alternately, there's now an option in e2fsck that'll tell it to store all of it's intermediate results in a directory instead of in RAM, which helps immensely. Create /etc/e2fsck.conf with the following contents:

[scratch_files]
directory = /var/cache/e2fsck

(And, obviously, make sure that directory exists, and is on a partition with a good few GB of free space). e2fsck will run SLLOOOOWWWWWWW, but at least it'll complete.

Of course, this won't work with the root FS, but if you've got swap then you're past mounting the root FS anyway.

Solution 2:

I ended up trying what womble suggested; here are some more details that may be useful if, like me, you haven't seen this new functionality in e2fsck before.

The "scratch_files" configuration option for e2fsck became available sometime in the version 1.40.x period. (In our case, we had to upgrade to the latest Debian distribution to get this functionality.)

As well as the "directory = /var/cache/e2fsk" option that was suggested, there are some further configuration options to fine tune how the scratch files storage is used. I used "dirinfo = false", since the filesystem had a large number of files, but not such a large number of directories. If the situation was reversed, the "icount" option would be appropriate. These options were all documented in the man page for e2fsck.conf.

BTW, Ted T'so wrote about these options in this thread.

I found that e2fsck was running extremely slowly, much more than predicted by Ted. It was running at 99.9% CPU utilization most of the time (on an extremely slow old processor), which suggests that storing these data structures on disk instead of memory was not the main cause of the slowdown. It might be that something else about what was stored in the filesystem made e2fsck particularly slow. In the end, I have abandoned the filesystem check for now; the filesystem was due for a check, but didn't have errors (as far as I know), so I'm going to arrange to check it at a more convenient time when we can afford to have a week-long outage.