How can I operate on all files of a certain type if they might not have the right extension?

0. The script wants to do something like this.

The script shown in your question tries to enumerate files and check if they are JPEGs, but does neither reliably. It tries to pass all the paths to file in a single run and extract both filenames and types from the output of file, which is reasonable since it may be faster than running file again and again for each file. But to do it correctly, you need to be careful about how the paths are passed to file, how file delimits its output, and how you consume that output. You can use this:

#!/bin/bash

find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
    read -r mimetype
    case "$mimetype" in image/jpeg)
        # Bash placed the filename in "$REPLY" -- put commands that use it here.
        # You can have as many commands as you want before the closing ";;" token.
        ;;
    esac
done

That's one of several correct ways. (It does not need to set IFS=; see below.) find with + passes multiple path arguments to file and only runs it as many times as necessary to process them all, usually just once. Credit goes to αғsнιη for the idea of passing --mime-type to file to obtain the MIME type, which contains the information you actually want and is easy to parse.

A detailed explanation follows. I've used the specific task of JPEG compression as an example. That's what the script you showed is for, and lepton has some oddities that should be considered in deciding how to improve that script. If you just want to see a script that runs lepton on each JPEG file, you can skip to section 7. Putting It All Together.

The term path has several definitions. In this answer I use it to mean pathname.

1. Installing lepton

The script you showed is meant to traverse a directory hierarchy, find JPEG images, and process them with the lossless JPEG compressor lepton. For the main motivation of your question, the command may not really matter, but different commands have different syntax. Some commands accept multiple input filenames for a single run. Most accept -- to indicate the end of options. I'll use lepton as my example. The lepton command doesn't accept multiple input filenames and doesn't recognize --.

To use lepton, install it first. It's officially packaged for Ubuntu 17.04 and later (sudo apt install lepton). For earlier Ubuntu releases, or to use a newer version than is packaged for your release, clone its git repository (git clone https://github.com/dropbox/lepton.git) and build the source as instructed in the README. Or you might be able to find a PPA.

Depending how you install it, lepton may be in /usr/bin, /usr/local/bin, or elsewhere. Probably you will want it somewhere in $PATH; then you can run it as lepton. The script you showed uses absolute paths to lepton and the standard utilities mv and rm, but not to the other standard utilities file, find, grep and cut. (This is Bash, so echo--pointless in that script anyway--is a shell builtin. exit is always a builtin.) Though this isn't one of the script's serious flaws, there's no discernible reason for such inconsistency. Unless you're writing a script to tolerate not having $PATH set sensibly--in which case you must use absolute paths for all external commands--I suggest using relative paths for standard commands and those you've installed.

2. Running lepton

Cautions and General Information

I tested with lepton v1.0-1.2.1-104-g209463a (from Git). lepton was released back in July 2016 so I'd guess the current syntax will keep working. But future versions may add features. If you're reading this years from now, you might check if lepton has added support for tasks that once required scripting.

Please be careful what command-line arguments you pass. For example, I tried running lepton with -verbose as the first argument and art.jpg as the second. It interpreted -verbose as an input filename and quit with an error, but not before truncating art.jpg--which it interpreted as an output filename--down to zero bytes. Fortunately I had a backup!

You can pass zero, one, or two paths to lepton. In all cases, it examines its input file or stream to see if it contains JPEG or Lepton data. JPEG is compressed to Lepton; Lepton is decompressed to JPEG. lepton will remove and add file extensions but doesn't use them to decide what to do.

Zero Filenames — lepton - reads from stdin and writes to stdout.

Thus lepton - < infile > outfile is one way to read from infile and write to outfile, even if their names start with - (like options do). But the method I'll use passes paths that start with ., so I won't have to worry about this.

One Filename — lepton infile reads infile and names its own output file.

This is how the script you showed uses lepton.

If the content of infile looks like a JPEG, lepton outputs a Lepton file; if its content looks like a Lepton file, lepton outputs a JPEG. lepton decides how it wants to name its output file by stripping an extension from infile, if any, and adding either a .jpg or .lep extension depending on what kind of file it is creating. But it does not use the extension it is removing (if any) to infer the type of file it is operating on.

It considers the last . and anything after it as an extension. If infile is a.b.c, you get a.b.lep or a.b.jpg. If the filename starts with a . with no other .s, lepton still regards that as an extension: from a JPEG called .abc you get .lep. Only . in the filename--not directory names--triggers this, so from a Lepton file x/fo.o/abc you get x/fo.o/abc.jpg (which you want), not x/fo.jpg (which would be bad).

If the output filename obtained this way names an existing file, _s are added to the end, after the extension, until it doesn't, and the name with added underscores is used: abc.lep, abc.lep_, abc.lep__, etc.,xyz.jpg, xyz.jpg_, xyz.jpg__, etc.

This works best when your files are named in a sensible way.

Automatically removing and adding extensions and adding underscores avoids a problem you'd otherwise have to manage yourself--preventing data loss when the output file already exists. But it also exposes what might be a deep design flaw in the script you showed. If your files are named sensibly, then all your JPEG files end in .jpg or .jpeg (maybe capitalized), and no non-JPEG files are so named. But then you don't have to examine the files with file to find out which ones are JPEGs!

Thus the premise of the script you showed is that files might not be named reasonably. It's always bad for a script to behave wrong or unexpectedly on filenames containing spaces, *, and other special characters. So its behavior of splitting on whitespace and expanding globs (the outer unquoted command substitution, intended just to split separate filenames, does this) is especially bad. See Byte Commander's excellent answer for details. This is probably the worst flaw in the script you showed.

But it's also worth considering what happens to filenames whose last . doesn't conceptually begin a file extension. Suppose Pictures has four files, all JPEGs: 01. Milan wide-angle sunset, 01. Milan wide-angle sunset highres, 02. Kyle birthday party prep - blooper cakes, and 03. The subtle found art of unopened expired paint cans with peeling labels. Then for f in ~/Pictures/0*; do lepton "$f"; done creates 01.lep, 01.lep_, 02.lep, and 03.lep--probably not what you want.

If you have JPEGs not named .jpg or maybe .jpeg, the best general approach is to rename them that way and investigate any naming conflicts that arise while doing so. But that's beyond the scope of this answer.

Those renaming problems happen with JPEGs not named like JPEGs, not non-JPEGs named like JPEGs. Yet even then, there may be a better solution. If the problem is ._ files from macOS and you don't want to delete them, just exclude files with a leading ._ (or even a leading .). Still, passing just one path to lepton avoids data loss (due to its _ appending rules); if the main goal is to exclude non-JPEGs, the basic idea is sound even though the implementation needs fixing.

So I'll use the one-path lepton infile syntax. But anyone who considers automating lepton like this on strangely named files should remember the generated .lep files may be named in ways that don't reveal the input filenames.

Two Filenames — lepton infile outfile does exactly what you expect.

But just because you expect it doesn't make it the right thing to do.

As with the other ways to run lepton, lepton determines whether infile is a JPEG to be compressed or a Lepton file to be decompressed by examining its content. If infile is a JPEG, lepton writes a Lepton file named outfile; if infile is a Lepton file, lepton writes a JPEG named outfile. With this two-path syntax, lepton doesn't change your specified output filename in any way. It doesn't add or remove extensions or append _s to resolve naming conflicts. If outfile already exists, it is overwritten.

You may want that, but if not and you use this syntax then you have to solve the problem yourself by making your script adjust the output filenames. You may be able to do this in a way that serves you better than lepton's own scheme when run with just one path argument. But I won't try to guess your specific needs and preferences; I'll just use the one-path syntax.

3. Passing Multiple Paths From find to file

The script you showed tries to use file $(find ./ ) to pass one path per argument to file by running find in command substitution. This often won't work, because $(find ./ ) splits on whitespace, which filenames can contain. It is common for files--especially images!--and folders to have spaces in their names. The script you showed treats a path ./abc/foo bar.jpg as two paths, ./abc/foo and bar.jpg. In the best case, neither exists; if they do, you unintentionally operate on the wrong thing. And the original path won't be processed at all.

Although the breadth of this problem can be lessened by setting IFS=$'\n' so word splitting is only performed between lines (\n represents a newline character), this isn't a good solution. Besides being awkward, it can still fail, as file and directory names may contain newlines. I advise against naming files or directories with them except to test programs or scripts for bugs. But such names can be created, including by accident where you don't expect them. The only characters a filename cannot contain are the path separator / and the null character. The null character is thus the only one that can't appear in a path and the only safe choice to delimit lists of arbitrary paths. That's why find has a -print0 action and xargs has a -0 option.

This can be done correctly with find . -print0 | xargs -0 ... but you don't need a third utility to pass paths from find to file. find's -exec action is sufficient. Arguments after -exec build the command to run, until \; or +. find ... -exec ... ; runs a command once per file, while find ... -exec ... + passes the command as many paths as it can per run, which is usually faster. Typically all the arguments fit and the command runs just once. In rare cases the command line would be too long and find runs the command more than once. So the + form is only safe for running commands that (a) take their path arguments at the end and (b) work the same in one run with multiple filenames as they do in separate runs.

lepton is an example of a command that must not be run using the + form of -exec because it does not accept multiple source filenames. The first would be the input, the second would be the output, and others would be excessive. But many commands do do the same thing when run once with several arguments as when run several times with one argument, and file is one of them.

This command will generate the table:

find . -exec file --mime-type -r0F '' {} +

find replaces the {} argument with a path when it invokes file, and replaces + with as many additional path arguments as will fit.

The options --mime-type -r0F '' passed to find are explained below.

Some people quote {}, e.g., '{}'. It's fine to do so, but neither Bash nor other Bourne-style shells require it. Bash and some other shells support brace expansion, but an empty pair of braces is not expanded. I choose not to quote {}, in light of the misconception that quoting {} prevents find from performing word splitting. Even if your shell required {} to be quoted, this would still have nothing to do with word splitting, because find never does that. (If you wanted word splitting, you'd have to tell find to -exec a shell.) And find can't tell if you've written {} or '{}'--the shell turns '{}' into {} (during quote removal) before passing it to find.

4. Emitting a Usable ⟨Path, File Type⟩ Table with file

The Problem

The reason I must pass some options to file--and can't just use find . -exec file {} +--is that the table file generates by default is ambiguous:

01. Milan wide-angle sunset:                  JPEG image data, JFIF standard 1.01, resolution (DPI), density 1x1, segment length 16, baseline, precision 8, 1400x1400, frames 3
02. Kyle birthday party prep - blooper cakes: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 512x512, frames 3
first line
second line:                       JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x500, frames 3

Those three rows look like four; one filename contains a newline. Filenames can also contain colons, so it won't always be clear where the filename ends. Way more confusing examples than shown above are possible.

The description column also has way more information than we need. Byte Commander explains one reason greping for JPEG in each whole row returns wrong results: a non-JPEG file with JPEG in its name gives a false positive. (The point of checking the type is that you can't rely on the name, so this is quite a self-defeating bug in the script you showed.) But even when you know you're looking in the description column, it may still contain JPEG even if that's not the type:

$ touch empty.JPEG  # not a JPEG
$ gzip -k empty.JPEG
$ file empty.JPEG*
empty.JPEG:    empty
empty.JPEG.gz: gzip compressed data, was "empty.JPEG", last modified: Mon Aug 28 16:37:56 2017, from Unix

Byte Commander's answer solved this by (a) passing the -b option to file, causing it to omit the paths, : separator, and spaces in front of the type, then (b) using grep to check if the description begins with JPEG (the ^ anchor in the pattern ^JPEG image data, does this). This works if you keep track of the paths passed to file--not a problem for Byte Commander's method, which ran file separately for each path anyway.

The Solution

I must use a different solution, because my goal is to parse both paths and types from file's output so that file needn't be run separately for each file. Fortunately file in Ubuntu has many options. I use file --mime-type -r0F '' paths:

  • --mime-type prints a MIME type rather than a detailed description. This is all I need, and then I can just perform an exact match against the whole thing. For a JPEG, file --mime-type shows image/jpeg in the description column. (See also αғsнιη's answer.)
  • According to man file, -r causes unprintable characters not to be replaced with octal escapes like \003. I believe I would otherwise need to add a step to convert such sequences back to the actual characters, which probably can't be done reliably--what if such a sequence appears literally in a filename? (file doesn't escape \ as \\.) I say "I believe" as I haven't managed to get file to print out such an escape sequence, and I'm not sure it really does so in the filename column. Either way, -r is safe here.
  • -0 is the key option here. Without it, this method couldn't work reliably. It makes file print a null character--the one character that is never allowed in paths because it is usually used to mark the ends of strings in C programs--immediately after the filename. This marks the break, in each row, between the two columns of the table.
  • -F '' makes file print nothing ('' is an empty argument) instead of :. The colon is unreliable (it can appear in filenames) and of no benefit here since a null character is already being printed to indicate the end of the path column and the start of the description column.

To make find run file --mime-type -r0F '' paths I use -exec file --mime-type -r0F '' {} +. find's -exec action replaces {} + with the paths.

5. Consuming the Table

I created the table this way:

find . -exec file --mime-type -r0F '' {} +

As detailed above, this places a null character after each path. It would be handy if the description were also null-terminated, but file won't do that--the description always ends with a newline. So I must alternately read until a null character, then assume there is more text and read it until a newline. I must do this for each file and stop when nothing is left.

Reading Each Row

That combination--read text that may contain a newline until a null character, then read text that can't contain a newline until a newline--isn't how any of the common Unix utilities are normally used. The approach I will take is to pipe the output of find to a loop. Each iteration of the loop reads a single row of the table by using the read shell builtin twice, with different options.

To read the path, I use:

read -rd ''
  • -r is read's only standard option and you should almost always use it. Without it, backslash escapes like \n from the input are translated into the characters they represent. We don't want that.
  • Normally, read reads until it sees a newline. To ignore newlines and stop at a null character instead, I use the -d option, which Bash provides, to specify a different character. For a null character, pass the empty argument ''.
  • I'm already using a Bash extension (the -d option), so I may as well avail myself of Bash's default behavior when no variable name is passed to read. It puts everything it read--except the terminating character--in the special variable $REPLY. Normally read strips whitespace ($IFS characters) from the beginning and end of the input, and it's a common idiom to write IFS= read ... to prevent that. When reading implicitly to $REPLY in Bash, this is not necessary.

To read the description, I use:

read -r mimetype
  • No backslashes should appear in the MIME type, but it's good practice to pass -r to read unless you want \ escapes translated.
  • This time, I am specifying a variable name explicitly. Call it what you like. I've chosen mimetype.
  • This time, the absence of IFS= to prevent leading and trailing whitespace from being stripped is significant. I want it removed. This drops the spaces from the beginning of the description that find writes to make the table more human-readable when it is shown in a terminal.

Composing the Loop

The loop should continue as long as there is another path to be read. The read command returns true (in shell programming this is zero, unlike almost all other programming languages) when it successfully reads something, and false (in shell programming, any nonzero value) when it doesn't. So the common while read idiom is useful here. I pipe (|) the output of find--which is the output of one or (rarely) more file commands--to the while loop.

find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
    read -r mimetype
    # Commands using "$REPLY" and "$mimetype" go here.
done

Inside the loop, I read the rest of the row to obtain the description (read -r mimetype). I don't bother checking if this succeeded. file should only ever output complete rows even if it encounters errors. (file sends error and warning messages to standard error, so they won't appear in the pipeline to corrupt the table.) You should be able to rely on this.

If you want to check if read -r mimetype succeeded anyway, you can use if. Or you can include it in the while loop condition:

find . -exec file --mime-type -r0F '' {} + |
while read -rd '' && read -r mimetype; do
    # Commands using "$REPLY" and "$mimetype" go here.
done

You can see I also split the top line for readability. (No \ is required to split at |.)

Testing the Loop

If you want to test the loop before proceeding, you can put this command under (or instead of) the # Commands... comment:

    printf '[%s] [%s]\n\n' "$REPLY" "$mimetype"

The loop output looks something like this, depending on what you have in the directory (and I have left out most entries, for brevity):

[.] [inode/directory]

[./stuv] [inode/x-empty]

[./ghi
jkl] [inode/x-empty]

[./fo.o/abc
def   ] [image/jpeg]

[./fo.o/wyz.lep] [application/octet-stream]

[./fo.o/wyz] [image/jpeg]

This is just to see if the loop works right. Placing the table's entries in [ ] like this wouldn't help the script do what it needs to do, as paths may contain [, ], and consecutive newlines.

6. Using the Extracted Path and File Type

In each iteration of the loop, "$REPLY" contains the path and "$mimetype" contains the type description. To find out if "$REPLY" names a JPEG file, check if "$mimetype" is exactly image/jpeg.

You can compare strings using if and [/test (or [[) with =. But I prefer case:

find -exec file --mime-type -r0F '' {} + | while read -rd ''; do
    read -r mimetype
    case "$mimetype" in image/jpeg)
        # Put commands here that use "$REPLY".
        ;;
    esac
done

If you just wanted to show the JPEGs' paths in the same format as above--to help test with paths containing newlines--the entire case...esac statement could be:

    case "$mimetype" in image/jpeg) printf '[%s]\n\n' "$REPLY";; esac

But the goal is to run lepton on each JPEG file. To do that, use:

    case "$mimetype" in image/jpeg) lepton "$REPLY";; esac

7. Putting It All Together

Adding that lepton command, and a hashbang line to run it with Bash, here's the complete script:

#!/bin/bash

find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
    read -r mimetype
    case "$mimetype" in image/jpeg) lepton "$REPLY";; esac
done

lepton reports what it is doing but it doesn't show filenames. This alternative script prints a message with each path before running lepton on it:

#!/bin/bash

find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
    read -r mimetype
    case "$mimetype" in image/jpeg)
        printf '\nProcessing "%s":\n' "$REPLY" >&2
        lepton "$REPLY"
    esac
done

I've printed the messages to standard error (>&2), since that's where lepton sends its own messages. That way, the output all stays together when piped or redirected. Running that script produces output like this (but more of it if you have more than two JPEGs):

Processing "./art.jpg":
lepton v1.0-1.2.1-104-g209463a
6777856 bytes needed to decompress this file
56363 86007
65.53%
2635854 bytes needed to decompress this file
56363 86007
65.53%

Processing "./fo.o/abc
def   ":
lepton v1.0-1.2.1-104-g209463a
6643508 bytes needed to decompress this file
36332 46875
77.51%
2456117 bytes needed to decompress this file
36332 46875
77.51%

The repetition in each stanza--which also appears when you run lepton without printing filenames--is because lepton checks that its output files can decompress correctly.

The script you showed had exit 0 at the end. You can do that if you like. It causes the script to always report success. Otherwise the script returns the exit status of the last command run--which is probably preferable. Either way, it may report success even if find, file, or lepton encountered problems, if the last lepton command succeeded. You can, of course, expand the script with more sophisticated error handling code.

8. Maybe You Want The Paths, Too

If you want to generate a list of paths separate from lepton's own output, you can take advantage of lepton's behavior of writing to standard error by printing the paths to standard output instead. In that case, you probably want to print just the paths and not a "Processing" message. You may optionally want to terminate the paths with null characters instead of newlines, as then you can process the list without breaking on paths that contain newlines.

#!/bin/bash

case "$1" in
    -0) format='%s\0';;
    *)  format='%s\n';;
esac

find . -exec file --mime-type -r0F '' {} + | while read -rd ''; do
    read -r mimetype
    case "$mimetype" in image/jpeg)
        printf "$format" "$REPLY"
        lepton "$REPLY"
    esac
done

When you run that script, you can pass the -0 flag to make it emit null characters instead of newlines. That script does not do proper Unix-style option processing: it only checks the first argument you pass; passing the flag repeatedly in the same argument (-00) doesn't work; and no option-related error messages are ever generated. This limitation is for brevity, and because you probably don't need anything more sophisticated, as the script doesn't support any non-option arguments and -0 is the only possible option.

On my system I called that script jpeg-lep3 and put it in ~/source, then ran ~/source/jpeg-lep3 -0 > out, which printed just lepton's output to my terminal. If you do something like that, you can test that null characters were properly written between paths using:

xargs -0 printf '[%s]\n\n' < out

Code first:

Let's do this with Bash's special globs and a for loop:

#!/bin/bash
shopt -s globstar dotglob

for f in ./** ; do 
    if file -b -- "$f" | grep -q '^JPEG image data,' ; then 

        # do whatever you want with the JPEG file "$f" in here:
        md5sum -- "$f"

    fi
done

Explanation:

First of all, we need to make the Bash globs more useful by enabling the globstar and dotglob shell options. Here is their description from man bash in the SHELL BUILTIN COMMANDS section about shopt:

 dotglob 
    If set, bash includes filenames beginning with a `.' in the results of 
    pathname expansion.
 globstar
    If set, the pattern ** used in a pathname expansion context will match 
    all files and zero or more directories and subdirectories. If the pattern
    is followed by a /, only directories and subdirectories match.

Then we use this new "recursive glob" ./** in a for loop to iterate over all files and folders inside the current directory and all its subdirectories. Please always use absolute paths or explicit relative paths starting with a ./ or ../ in your globs, not just **, to prevent problems with special file names like ~.

Now we test each file (and folder) name with the file command for its contents. The -b option prevents it from printing the file name again before the content information string, which makes filtering more safe.

Now we know that the content information of all valid JPG/JPEG files must start with JPEG image data,, which is what we test the output of file for with grep. We use the -q option to suppress any output, as we are only interested in grep's exit code, which indicates if the pattern matched or not.

If it matched, the code inside the if/then block will be executed. We can do anything we want in here. The current JPEG filename is available in the shell variable $f. We just have to make sure to always put it in double quotes to prevent the accidental evaluation of filenames with special characters like spaces, newlines, or symbols. It is also usually best to separate it from other arguments by placing it after --, which causes most commands to interpret it as a filename even if it's something like -v or --help that would otherwise be interpreted as an option.


Bonus question:

Time to blow up some code, for science! Here is the version from your question/book:

for jpeg in `echo "$(file $(find ./ ) 
    | grep JPEG | cut -f 1 -d ':')"`
do
     /path/to/command "$jpeg"
done

First of all, allow me to mention how complex they wrote it. We have 4 levels of nested subshells, using mixed command substitution syntaxes (`` and $()), which are just necessary because of the incorrect/suboptimal usage of find.

Here find just lists all files and prints their names, one per line. Then the full output is passed to file to examine each of them. But wait! One file name per line? What about file names containing newlines? Right, those will break it!

$ ls --escape ne*ne
new\nline
$ file $(find . -name 'ne*ne' )
./new: cannot open `./new' (No such file or directory)
line:  cannot open `line' (No such file or directory)

Actually even simple spaces break it too, because those are treated as separators as well by file. You can't even quote the "$(find ./ )" here as a remedy, because that would then quote the whole multi-line output as one single filename argument.

$ ls simple*
simple spaces.jpg
$ file $(find ./ -name 'simple*')
./simple:   cannot open `./simple' (No such file or directory)
spaces.jpg: cannot open `spaces.jpg' (No such file or directory)

Next step, the file output gets scanned with grep JPEG. Don't you think it's a bit easy to trick such a simple pattern, especially as the output of plain file always contains the file name as well? Basically everything with "JPEG" in its file name will trigger a match, no matter what it contains.

$ echo "to be or not to be" > IAmNoJPEG.txt
$ file IAmNoJPEG.txt | grep JPEG
IAmNoJPEG.txt: ASCII text

Okay, so we have the file output of all JPEG files (or those who pretend to be one), now they process all lines with cut to extract the original file name from the first column, separated by a colon... Guess what, let's try this on a file with a colon in its name:

$ ls colon*
colons:evil.jpeg
$ file colon* | grep JPEG | cut -f 1 -d ':'
colons

So to conclude, the approach from your book works, but only if all files it checks do not contain any spaces, newlines, colons and probably other special characters and do not contain the string "JPEG" anywhere in their filenames. It is also kind of ugly, but as beauty lies in the eye of the beholder, I'm not going to ramble about that.