Count files in directory with specific string on name?

Do you mean you want to search for snp in the file names? That would be a simple shell glob (wildcard), used like this:

ls -dq *snp* | wc -l

Omit the -q flag if your version of ls doesn't recognise it. It handles filenames containing "strange" characters (including newlines).


If you stand quietly in the hallways of Unix&Linux and listen carefully, you’ll hear a ghostly voice, pitifully wailing, “What about filenames that contain newlines?”

ls -d *snp* | wc -l

or, equivalently,

printf "%s\n" *snp* | wc -l

will output all the filenames that contain snp, each followed by a newline, but also including any newlines in the filenames, and then count the number of lines in the output.  If there is a file whose name is

                                f o o s n p \n b a r . t s v

then that name will be written out as

foosnp
bar.tsv

which, of course, will be counted as two lines.

There are a few alternatives that do better in at least some cases:

printf "%s\n" * | grep -c snp

which counts the lines that contain snp, so the foosnp(\n)bar.tsv example from above counts only once.  A slight variation on this is

ls -f | grep -c snp

The above two commands differ in that:

  • The ls -f will include files whose names begin with .; the printf … * does not, unless the dotglob shell option is set.
  • printf is a shell builtin; ls is an external command.  Therefore, the ls might use slightly more resources.
  • When the shell processes a *, it sorts the filenames; ls -f does not sort the filenames.  Therefore, the ls might use slightly less resources.

But they have something in common: they will both give wrong results in the presence of filenames that contain newline and have snp both before and after the newline.

Another:

filenamelist=(*snp*)
echo ${#filenamelist[@]}

This creates a shell array variable listing all the filenames that contain snp, and then reports the number of elements in the array.  The filenames are treated as strings, not lines, so embedded newlines are not an issue.  It is conceivable that this approach could have a problem if the directory is huge, because the list of filenames must be held in shell memory.

Yet another:

Earlier, when we said printf "%s\n" *snp*, the printf command repeated (reused) the "%s\n" format string once for each argument in the expansion of *snp*.  Here, we make a small change in that:

printf "%.0s\n" *snp* | wc -l

This will repeat (reuse) the "%.0s\n" format string once for each argument in the expansion of *snp*.  But "%.0s" means to print the first zero characters of each string — i.e., nothing.  This printf command will output only a newline (i.e., a blank line) for each file that contains snp in its name; and then wc -l will count them.  And, again, you can include the . files by setting dotglob.


Abstract:

Works for files with "odd" names (including new lines).

set -- *snp* ; echo "$#"                             # change positional arguments

count=$(printf 'x%.0s' *snp*); echo "${#count}"      # most shells

printf -v count 'x%.0s' *snp*; echo "${#count}"      # bash

Description

As a simple glob will match every filename with snp in its name a simple echo *snp* could be enough for this case, but to really show that there are only three files matching I'll use:

$ ls -Q *snp*
"Codigo-0275_tdim.snps.tsv"  "foo * bar\tsnp baz.tsv"  "S134_tdim.snps.tsv"

The only issue remaining is to count the files. Yes, grep is an usual solution, and yes counting new lines with wc -l is also an usual solution. Note that grep -c (count) really counts how many times a snp string is matched, and, if one file name has more than one snp string in the name, the count will be incorrect.

We can do better.

One simple solution is to set the positional arguments:

$ set -- *snp*
$ echo "$#"
3

To avoid changing the positional arguments we can transform each argument to one character and print the length of the resulting string (for most shells):

$ printf 'x%.0s' *snp*
xxx

$ count=$(printf 'x%.0s' *snp*); echo "${#count}"
3

Or, in bash, to avoid a subshell:

$ printf -v count 'x%.0s' *snp*; echo "${#count}"
3

File list

List of files (from the original question with one with an newline added):

a='
Codigo-0275_tdim.matches.tsv
Codigo-0275_tdim.snps.tsv
FloragenexTdim_haplotypes_SNp3filter17_single.tsv
FloragenexTdim_haplotypes_SNp3filter17.tsv
FloragenexTdim_SNP3Filter17.fas
S134_tdim.alleles.tsv
S134_tdim.snps.tsv
S134_tdim.tags.tsv'
$ touch $a

touch $'foosnp\nbar.tsv' 

That will have a file with one newline in the middle:

f o o s n p \n b a r . t s v

And to test glob expansion:

$ touch $'foo * bar\tsnp baz.tsv'

That will add an asterisk, that, if unquoted, will expand to the whole list of files.

Tags:

Bash