Copy files to local from multiple directories in HDFS for last 24 hours

You can make it simpler by using "find" in combination with "cp", for example:

find /path/to/directory/ -type f -name "*.csv" | xargs cp -t /path/to/copy

If you want to clean your directory of files older than 24 hours, you can use:

find /path/to/files/ -type f -name "*.csv" -mtime +1 | xargs rm -f

Maybe you can implement them as script, then set it as a task on Cron.


note: I was unable to test this, but you could test this step by step by looking at the output:

Normally I would say Never parse the output of ls, but with Hadoop, you don't have a choice here as there is no equivalent to find. (Since 2.7.0 there is a find, but it is very limited according to the documentation)

Step 1: recursive ls

$ hadoop fs -ls -R /path/to/folder/

Step 2: use awk to pick files only and CSV files only
directories are recognized by their permissions that start with d, so we have to exclude those. And the CSV files are recognized by the last field ending with "csv":

$ hadoop fs -ls -R /path/to/folder/ | awk '!/^d/ && /\.csv$/'

make sure you do not end up with funny lines here which are empty or just the directory name ...

Step 3: continue using awk to process the time. I am assuming you have any standard awk, so I will not use GNU extensions. Hadoop will output the time format as yyyy-MM-dd HH:mm. This format can be sorted and is located in fields 6 and 7:

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff)'

Step 4: Copy files one by one:

First, check the command you are going to execute:

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
            print "migrating", $NF
            cmd="hadoop fs -get "$NF" /path/to/local/"
            print cmd
            # system(cmd)
         }'

(remove # if you want to execute)

or

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
            print $NF
         }' | xargs -I{} echo hadoop fs -get '{}' /path/to/local/

(remove echo if you want to execute)

Tags:

Bash

Hadoop

Hdfs