Randomly draw a certain number of lines from a data file

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 \
 'BEGIN{ srand()
         do{ lnb = 1 + int(rand()*N)
             if ( !(lnb in R) ) {
                 R[lnb] = 1
                 ct++ }
         } while (ct<m)
  } { if (R[NR]==1) print > out1 
      else          print > out2       
  }' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

As with all things Unix, There's a Utility for That^TM.

Program of the day: split
split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab. Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

Randomly draw a certain number of lines from a data file

Tags:

Linux

Shell

Text Processing

Related

Recent Posts