rsync to get a list of only file names

After years of work, here is my solution to this age-old problem:

DIR=`mktemp -d /tmp/rsync.XXXXXX`
rsync -nr --out-format='%n' serveripaddress::pt/dir/files/ $DIR > output.txt
rmdir $DIR

Hoping the question will be moved to the appropriate site, I'll answer here nevertheless.

You could append a pipe with awk:

rsync ... | awk '{ $1=$2=$3=$4=""; print substr($0,5); }' >output.txt

This eliminates all the unwanted information by outputting everything from the 5th field, but works only if none of the first four fields in the output format gets an additional whitespace somewhere (which is unlikely).

This awk solution won't work if there are file names starting with whitespace.

An even more robust way to solve could be a rather complex program which as well makes assumptions.

It works this way: For each line,

  • Cut off the first 10 bytes. Verify that they are followed by a number of spaces. Cut them off as well.
  • Cut off all following digits. Verify that they are followed by one space. Cut that off as well.
  • Cut off the next 19 bytes. Verify that they contain a date and a time stamp in the appropriate format. (I don't know why the date's components are separated with / instead of - - it is not compliant with ISO 8601.)
  • Verify that now one space follows. Cut that off as well. Leave any following whitespace characters intact, as they belong to the file name.
  • If the test has passed all these verifications, it is likely that the remainder of that line will contain the file name.

It gets even worse: for very esoteric corner cases, there are even more things to watch: File names can be escaped. Certain unprintable bytes are replaced by an escape sequence (#ooo with ooo being their octal code), a process which must be reversed.

Thus, neither awk nor a simple sed script will do here if we want to do it properly.

Instead, the following Python script can be used:

def rsync_list(fileobj):
    import re
    # Regex to identify a line
    line_re = re.compile(r'.{10} +\d+ ..../../.. ..:..:.. (.*)\n')
    # Regex for escaping
    quoted_re = re.compile(r'\\#(\d\d\d)')
    for line in fileobj:
        match = line_re.match(line)
        assert match, repr(line) # error if not found...
        quoted_fname = match.group(1) # the filename part ...
        # ... must be unquoted:
        fname = quoted_re.sub( # Substitute the matching part...
            lambda m: chr(int(m.group(1), 8)), # ... with the result of this function ...
            quoted_fname)                      # ... while looking at this string.
        yield fname

if __name__ == '__main__':
    import sys
    for fname in rsync_list(sys.stdin):
        #import os
        #print repr(fname), os.access(fname, os.F_OK)
        #print repr(fname)
        sys.stdout.write(fname + '\0')

This outputs the list of file names separated by NUL characters, similiar to the way find -print0 and many other tools work so that even a file name containing a newline character (which is valid!) is retained correctly:

rsync . | python rsf.py | xan -0 stat -c '%i'

correctly shows the inode number of every given file.

Certainly I may have missed the one or other corner case I didn't think of, but I think that the script correctly handles the very most cases (I tested with all 255 thinkable one-byte-filenames as well as a file name starting with a space).