Find all the PDFs with at least three characters in their name

Here it's easier with standard wildcards:

find ~ -name '*???.[pP][dD][fF]'

Or with some find implementations (those that support -regex also support -iname):

find ~ -iname '*???.pdf'

For arbitrary numbers of characters instead of 3, that's where you may prefer to revert to -iregex where available (see @Stephen Kitt's answer) or you could use zsh or ksh93 globs:

  • zsh:

    set -o extendedglob # best in ~/.zshrc
    printf '%s\n' ~/**/?(#c3,).(#i)pdf(D)
    

    (the (D) to consider hidden files and files in hidden dirs like with find)

    • (#cx,y) is the zsh wildcard equivalent of regexp {x,y}
    • (#i) for case insensitive
    • ? standard wildcard for any single character (like regexp .)
    • **/: any level of subdirectories (including 0)
  • ksh93:

    FIGNORE='@(.|..)' # to consider hidden files
    set -o globstar
    printf '%s\n' **/{3,}(?).~(i:pdf)
    
    • @(x|y): extended ksh wildcard operator similar to regexp (x|y).
    • FIGNORE: special variable which controls what files are ignored by globs. When set, the usual ignoring of hidden files is not done, but we still want to ignore the . and .. directory entries where present.
    • {x,y}(z) is ksh93's equivalent of regexp z{x,y}.
    • ~(i:...): case-insensitive matching.

Globs have some extra advantages over find here in that you get a sorted list (you can disable that sorting in zsh with the oN glob qualifier, or use different sorting criteria) and also work when filenames contain sequence of bytes that don't form valid characters (for instance, in a locale using the UTF-8 charset, the find approach would fail to report a $'St\xE9phane Chazelas - CV.pdf as that \xE9 being not a character is not matched by regexp . or wildcard ? or * with GNU find).


Assuming you’re using GNU find (which you probably are, since -iregex is a GNU extension to POSIX find), -regex and -iregex default to Emacs regular expressions, which don’t recognise {3,}. You need to specify a different type of regular expressions using the -regextype option; in addition, you need to adjust your regular expression to the fact that the expression matches against the full path:

find ~ -regextype posix-extended -iregex '.*/[^/]{3,}.pdf'

You should also escape the . so that it matches “.” rather than any character:

find ~ -regextype posix-extended -iregex '.*/[^/]{3,}\.pdf'

The regular expression can be simplified since we only care about three non-“/” characters:

find ~ -regextype posix-extended -iregex '.*[^/]{3}\.pdf'

For completeness, with FreeBSD or NetBSD find (another implementation that supports -iregex, not yours though as .+ wouldn't work there without -E), you'd write:

find ~ -iregex '.*[^/]\{3\}\.pdf'

or:

find -E ~ -iregex '.*[^/]{3}\.pdf'

Without -E, that's basic regular expression (like in grep) and with -E extended regular expression (like in grep -E).

With ast-open's find:

find ~ -iregex '.*[^/]{3}\.pdf'

(that's extended regexps out of the box).


How do I know they're PDFs?

You don't unless you ask. Sure, I'm being pedantic, but you didn't ask about files with .pdf in their names. Just because a file has the characters .pdf in the filename does not make it a PDF file.

In fact, let's be all-the-way pedantic about this: if the last four characters of a file's name are .pdf, then it will always have more than three characters in its name.

So doing this the wrong way, you might say:

$ find . -type f -name "*???.pdf"
./Documents/McLaren 720s Coupe:Order Summary.pdf
./Documents/Setup_MagicISO.exe.pdf

See that second one? It's actually an executable. (I know, I changed the name.) And I'm also missing a PDF I coulda sworn was in the Documents directory...

$ ls Documents
McLaren 720s Coupe:Order Summary.pdf
Pioneer Premier DEH-P490IB CD Install Manual.PDF
Setup_MagicISO.exe.pdf

So using -iname we could find that one, but that's still turning up this not-a-PDF file.

What we really want to do in this case is examine the file's magic number using the file command. One option outputs the MIME type, which is simpler to parse. The find query then becomes a simple -name "???*".

$ find . -type f -name "???*" -print0|xargs -0 file --mime
./.bash_history:                                              text/plain; charset=us-ascii
./.bash_logout:                                               text/plain; charset=us-ascii
./.bashrc:                                                    text/plain; charset=us-ascii
./.profile:                                                   text/plain; charset=us-ascii
./Documents/McLaren 720s Coupe:Order Summary.pdf:             application/pdf; charset=binary
./Documents/Pioneer Premier DEH-P490IB CD Install Manual.PDF: application/pdf; charset=binary
./Documents/Setup_MagicISO.exe.pdf:                           application/x-dosexec; charset=binary
./Downloads/Setup_MagicISO.exe:                               application/x-dosexec; charset=binary
./Downloads/WindowsUpdate.diagcab:                            application/vnd.ms-cab-compressed; charset=binary

Let's use the colon delimiter, and look for MIME type application/pdf, then zero out that portion and print the result. Take note, one of my files has a colon in the name; so I can't just ask awk to ($2==":"){print $1}.

$ find . -type f -name "???*" -print0|xargs -0 file --mime|awk -F: '($NF~"application/pdf"){OFS=":";$NF="";print}'|sed s/:$//
./Documents/McLaren 720s Coupe:Order Summary.pdf
./Documents/Pioneer Premier DEH-P490IB CD Install Manual.PDF

Now let's finish up by contriving to include PDF files named a and abc:

$ mkdir Documents/other
$ cp -a Documents/McLaren\ 720s\ Coupe\:Order\ Summary.pdf Documents/other/a
$ cp -a Documents/Pioneer\ Premier\ DEH-P490IB\ CD\ Install\ Manual.PDF  Documents/other/abc
$ find . -type f -name "???*" -print0|xargs -0 file --mime|awk -F: '($NF~"application/pdf"){OFS=":";$NF="";print}'|sed s/:$//
./Documents/McLaren 720s Coupe:Order Summary.pdf
./Documents/Pioneer Premier DEH-P490IB CD Install Manual.PDF
./Documents/other/abc

That's all. I know I'll probably get dinged for being horribly pedantic, but in my job with thousands of NFS volumes to hunt and all kinds of poorly-named files, I wish more people would be pedantic.

Edited to add: in the real world, I might want to make use of updatedb to build a searchable file index, locate instead of find to read that index, and parallel instead of xargs to thread 'er up. That's somewhat outside the scope of this question though. I wrote that with a straight face, too. Why do I care so much? I might be looking for movie and audio files; or certain types of photographs; or binary executables in a project data directory.

Tags:

Find