Is there a convenient way to classify files as "binary" or "text"?

If you ask file for just the mime-type you'll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the "text" part you should get good results. Eg (-b for no filename in output):

file -b --mime-type filename | sed 's|/.*||'

Another approach would be to use isutf8 from the moreutils collection.

It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.


If you like the heuristic used by GNU grep, you could use it:

isbinary() {
  LC_MESSAGES=C grep -Hm1 '^' < "${1-$REPLY}" | grep -q '^Binary'
}

It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don't form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.

The ${1-$REPLY} form allows you to use it as a zsh glob qualifier:

ls -ld -- *(.+isbinary)

would list the binary files.

Tags:

Text

Files