I would like to find the largest file in each directory recursively

With GNU find, sort and sed (4.2.2 or above), sort once on the file sizes and again on directory paths:

find /some/dir -type f -printf '%s %f%h\0' | 
  sort -zrn |
  sort -zut/ -k2 |
  sed -zre 's: ([^/]*)(/.*): \2/\1:'

Explanation:

  • The file size, name and path are printed (the first separated by a space and the next two separated by /), and each entry is terminated by the ASCII NUL character.
  • Then we sort numerically using the size, assuming NUL-delimited output (and in reverse order, so largest files first).
  • Then we use sort to print only the first unique entries using everything from the second /-separated field, which would be the path to the directory containing the file.
  • Then we use sed to swap the directory and filenames, so that we get a normal path.

For readable output, replace the ASCII NUL with newlines:

find /some/dir -type f -printf '%s %f%h\0' | 
  sort -zrn |
  sort -zut/ -k2 |
  sed -zre 's: ([^/]*)(/.*): \2/\1:' |
  tr '\0' '\n'

Example output:

$ find /var/log -type f -printf '%s %f%h\0' | sort -zrn | sort -zt/ -uk2 | sed -zre 's: ([^/]*)(/.*): \2/\1:' | tr '\0' '\n'
3090885 /var/log/syslog.1
39789 /var/log/apt/term.log
3968 /var/log/cups/access_log.1
31 /var/log/fsck/checkroot
467020 /var/log/installer/initial-status.gz
44636 /var/log/lightdm/seat0-greeter.log
15149 /var/log/lxd/lxd.log
4932 /var/log/snort/snort.log
3232 /var/log/unattended-upgrades/unattended-upgrades-dpkg.log

Combining find and awk allows the averages to be calculated too:

find . -type f -printf '%s %h/%f\0'|awk 'BEGIN { RS="\0" } { SIZE=$1; for (i = 1; i <= NF - 1; i++) $i = $(i + 1); NF = NF - 1; DIR=$0; gsub("/[^/]+$", "", DIR); FILE=substr($0, length(DIR) + 2); SUMSIZES[DIR] += SIZE; NBFILES[DIR]++; if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) { MAXSIZE[DIR] = SIZE; BIGGESTFILE[DIR] = FILE } }; END { for (DIR in SUMSIZES) { printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR] } }'

Laid out in a more readable manner, the AWK script is

BEGIN { RS="\0" }

{
  SIZE=$1
  for (i = 1; i <= NF - 1; i++) $i = $(i + 1)
  NF = NF - 1
  DIR=$0
  gsub("/[^/]+$", "", DIR)
  FILE=substr($0, length(DIR) + 2)
  SUMSIZES[DIR] += SIZE
  NBFILES[DIR]++
  if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) {
    MAXSIZE[DIR] = SIZE
    BIGGESTFILE[DIR] = FILE
  }
}

END {
  for (DIR in SUMSIZES) {
    printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR]
  }
}

This expects null-separated input records (I stole this from muru’s answer); for each input record, it

  • stores the size (for later use),
  • removes everything before the first character in the path (so we at least handle filenames with spaces correctly),
  • extracts the directory,
  • extracts the filename,
  • adds the size we stored earlier to the sum of sizes in the directory,
  • increments the number of files in the directory (so we can calculate the average),
  • if the size is larger than the stored maximum size for the directory, or if we haven’t seen a file in the directory yet, updates the information for the biggest file.

Once all that’s done, the script loops over the keys in SUMSIZES and outputs the directory, average size, largest file’s name and size.

You can pipe the output into sort to sort by directory name. If you want to additionally format the sizes in human-friendly form, you can change the printf line to

printf "%.2f %d %s: %s\n", SUMSIZES[DIR] / NBFILES[DIR], MAXSIZE[DIR], DIR, BIGGESTFILE[DIR]

and then pipe the output into numfmt --field=1,2 --to=iec. You can still sort the result by directory name, you just need to sort starting with the third field: sort -k3.


Zsh's wildcard patterns would be very useful for the sort of things you're doing. Specifically, zsh can match files by attributes such as type, size, etc. through glob qualifiers. Glob qualifiers also allow sorting the matches.

For example, in zsh, *(.DOLN[1]) expands to the name of the largest file in the current directory. * is the pattern for the file name (match everything, except possibly dot files depending on shell options). The qualifier . restricts the matches to regular files, D causes * to include dot files, OL sorts by decreasing size (“length”), N causes the expansion to be empty if there is no matching file at all, and [1] selects only the first match.

You can enumerate directories recursively with **/. For example the following loop iterates over all the subdirectories of the current directory and their subdirectories recursively:

for d in **/*(/); do … done

You can use zstat to access a file's size and other metadata without having to rely on other tools for parsing.

zmodload -F zsh/stat b:zstat
files=(*(DNoL))
zstat -A sizes +size -- $files
total=0; for s in $sizes; do total+=$s; done
if ((#sizes > 0)); then
  max=$sizes[-1]
  average=$((total/#sizes))
  median=$sizes[$((#sizes/2))]
fi

Tags:

Find

Size

Files