Why *not* parse `ls` (and what to do instead)?

I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:

import os, sys
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(subdir, f)).st_ino
      sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' or f[-1] == '~': continue
        lstat = os.lstat(os.path.join(subdir, f))
        filelist.append((f, subdir, lstat.st_ino))

filelist.sort(key = lambda x: x[0])
for f, subdir, ino in filelist: 
   sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

That link is referenced a lot because the information is completely accurate, and it has been there for a very long time.


ls replaces non-printable characters with glob characters yes, but those characters aren't in the actual filename. Why does this matter? 2 reasons:

  1. If you pass that filename to a program, that filename doesn't actually exist. It would have to expand the glob to get the real file name.
  2. The file glob might match more than one file.

For example:

$ touch a$'\t'b
$ touch a$'\n'b
$ ls -1
a?b
a?b

Notice how we have 2 files which look exactly the same. How are you going to distinguish them if they both are represented as a?b?


The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

There is a difference here. When you get a glob back, as shown, that glob might match more than one file. However when you iterate through the results matching a glob, you get back the exact file, not a glob.

For example:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Notice how the xxd output shows that $file contained the raw characters \t and \n, not ?.

If you use ls, you get this instead:

for file in $(ls -1q); do printf '%s' "$file" | xxd; done
0000000: 613f 62                                  a?b
0000000: 613f 62                                  a?b

"I'm going to iterate anyway, why not use ls?"

Your example you gave doesn't actually work. It looks like it works, but it doesn't.

I'm referring to this:

 for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done

I've created a directory with a bunch of file names:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

When I run your code, I get this:

$ for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done
./a b
./a b

Where'd the rest of the files go?

Let's try this instead:

$ for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a??b’: No such file or directory
./a b
./a b
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a?b’: No such file or directory

Now lets use an actual glob:

$ for f in *; do stat --format='%n' "./$f"; done
./a b
./a  b
./a b
./a b
./a b
./a
b

With bash

The above example was with my normal shell, zsh. When I repeat the procedure with bash, I get another completely different set of results with your example:

Same set of files:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Radically different results with your code:

for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done
./a b
./a b
./a b
./a b
./a
b
./a  b
./a b
./a b
./a b
./a b
./a b
./a b
./a
b
./a b
./a b
./a b
./a b
./a
b

With a shell glob, it works perfectly fine:

$ for f in *; do stat --format='%n' "./$f"; done
./a b
./a  b
./a b
./a b
./a b
./a
b

The reason bash behaves this way goes back to one of the points I made at the beginning of the answer: "The file glob might match more than one file".

ls is returning the same glob (a?b) for several files, so each time we expand this glob, we get every single file that matches it.


How to recreate the list of files I was using:

touch 'a b' 'a  b' a$'\xe2\x80\x82'b a$'\xe2\x80\x83'b a$'\t'b a$'\n'b

The hex code ones are UTF-8 NBSP characters.


The output of ls -q isn't a glob at all. It uses ? to mean "There is a character here that can't be displayed directly". Globs use ? to mean "Any character is allowed here".

Globs have other special characters (* and [] at least, and inside the [] pair there are more). None of those are escaped by ls -q.

$ touch x '[x]'
$ ls -1q
[x]
x

If you treat the ls -1q output there are a set of globs and expand them, not only will you get x twice, you'll miss [x] completely. As a glob, it doesn't match itself as a string.

ls -q is meant to save your eyes and/or terminal from crazy characters, not to produce something that you can feed back to the shell.

Tags:

Shell

Ls