How can I count files with a particular extension, and the directories they are in?

I haven't examined the output with symlinks but:

find . -type f -iname '*.c' -printf '%h\0' |
  sort -z |
  uniq -zc |
  sed -zr 's/([0-9]) .*/\1 1/' |
  tr '\0' '\n' |
  awk '{f += $1; d += $2} END {print f, d}'
  • The find command prints the directory name of each .c file it finds.
  • sort | uniq -c will gives us how many files are in each directory (the sort might be unnecessary here, not sure)
  • with sed, I replace the directory name with 1, thus eliminating all possible weird characters, with just the count and 1 remaining
  • enabling me to convert to newline-separated output with tr
  • which I then sum up with awk, to get the total number of files and the number of directories that contained those files. Note that d here is essentially the same as NR. I could have omitted inserting 1 in the sed command, and just printed NR here, but I think this is slightly clearer.

Up until the tr, the data is NUL-delimited, safe against all valid filenames.


With zsh and bash, you can use printf %q to get a quoted string, which would not have newlines in it. So, you might be able to do something like:

shopt -s globstar dotglob nocaseglob
printf "%q\n" **/*.c | awk -F/ '{NF--; f++} !c[$0]++{d++} END {print f, d}'

However, even though ** is not supposed to expand for symlinks to directories, I could not get the desired output on bash 4.4.18(1) (Ubuntu 16.04).

$ shopt -s globstar dotglob nocaseglob
$ printf "%q\n" ./**/*.c | awk -F/ '{NF--; f++} !c[$0]++{d++} END {print f, d}'
34 15
$ echo $BASH_VERSION
4.4.18(1)-release

But zsh worked fine, and the command can be simplified:

$ printf "%q\n" ./**/*.c(D.:h) | awk '!c[$0]++ {d++} END {print NR, d}'
29 7

D enables this glob to select dot files, . selects regular files (so, not symlinks), and :h prints only the directory path and not the filename (like find's %h) (See sections on Filename Generation and Modifiers). So with the awk command we just need to count the number of unique directories appearing, and the number of lines is the file count.


Python has os.walk, which makes tasks like this easy, intuitive, and automatically robust even in the face of weird filenames such as those that contain newline characters. This Python 3 script, which I had originally posted in chat, is intended to be run in the current directory (but it doesn't have to be located in the current directory, and you can change what path it passes to os.walk):

#!/usr/bin/env python3

import os

dc = fc = 0
for _, _, fs in os.walk('.'):
    c = sum(f.endswith('.c') for f in fs)
    if c:
        dc += 1
        fc += c
print(dc, fc)

That prints the count of directories that directly contain at least one file whose name ends in .c, followed by a space, followed by the count of files whose names end in .c. "Hidden" files--that is, files whose names start with .--are included, and hidden directories are similarly traversed.

os.walk recursively traverses a directory hierarchy. It enumerates all the directories that are recursively accessible from the starting point you give it, yielding information about each of them as a tuple of three values, root, dirs, files. For each directory it traverses to (including the first one whose name you give it):

  • root holds the pathname of that directory. Note that this is totally unrelated to the system's "root directory" / (and also unrelated to /root) though it would go to those if you start there. In this case, root starts at the path .--i.e., the current directory--and goes everywhere below it.
  • dirs holds a list of the pathnames of all the subdirectories of the directory whose name is currently held in root.
  • files holds a list of the pathnames of all the files that reside in the directory whose name is currently held in root but that are not themselves directories. Note that this includes other kinds of files than regular files, including symbolic links, but it sounds like you don't expect any such entries to end in .c and are interested in seeing any that do.

In this case, I only need to examine the third element of the tuple, files (which I call fs in the script). Like the find command, Python's os.walk traverses into subdirectories for me; the only thing I have to inspect myself is the names of the files each of them contains. Unlike the find command, though, os.walk automatically provides me a list of those filenames.

That script does not follow symbolic links. You very probably don't want symlinks followed for such an operation, because they could form cycles, and because even if there are no cycles, the same files and directories may be traversed and counted multiple times if they are accessible through different symlinks.

If you ever did want os.walk to follow symlinks--which you usually wouldn't--then you can pass followlinks=true to it. That is, instead of writing os.walk('.') you could write os.walk('.', followlinks=true). I reiterate that you would rarely want that, especially for a task like this where you are recursively enumerating an entire directory structure, no matter how big it is, and counting all the files in it that meet some requirement.


Find + Perl:

$ find . -type f -iname '*.c' -printf '%h\0' | 
    perl -0 -ne '$k{$_}++; }{ print scalar keys %k, " $.\n" '
7 29

Explanation

The find command will find any regular files (so no symlinks or directories) and then print the name of directory they are in (%h) followed by \0.

  • perl -0 -ne : read the input line by line (-n) and apply the script given by -e to each line. The -0 sets the input line separator to \0 so we can read null-delimited input.
  • $k{$_}++ : $_ is a special variable that takes the value of the current line. This is used as a key to the hash %k, whose values are the number of times each input line (directory name) was seen.
  • }{ : this is a shorthand way of writing END{}. Any commands after the }{ will be executed once, after all input has been processed.
  • print scalar keys %k, " $.\n": keys %k returns an array of the keys in the hash %k. scalar keys %k gives the number of elements in that array, the number of directories seen. This is printed along with the current value of $., a special variable that holds the current input line number. Since this is run at the end, the current input line number will be the number of the last line, so the number of lines seen so far.

You could expand the perl command to this, for clarity:

find  . -type f -iname '*.c' -printf '%h\0' | 
    perl -0 -e 'while($line = <STDIN>){
                    $dirs{$line}++; 
                    $tot++;
                } 
                $count = scalar keys %dirs; 
                print "$count $tot\n" '