Command line tool to "cat" pairwise expansion of all rows in a file

Here's how to do it in awk so that it doesn't have to store the whole file in an array. This is basically the same algorithm as terdon's.

If you like, you can even give it multiple filenames on the command line and it will process each file independently, concatenating the results together.

#!/usr/bin/awk -f

#Cartesian product of records

{
    file = FILENAME
    while ((getline line <file) > 0)
        print $0, line
    close(file)
}

On my system, this runs in about 2/3 the time of terdon's perl solution.

I'm not sure this is better than doing it in memory, but with a sed that reads out its infile for every line in its infile and another on the other side of a pipe alternating Hold space with input lines...

cat <<\IN >/tmp/tmp
Row1,10
Row2,20
Row3,30
Row4,40
IN

</tmp/tmp sed -e 'i\
' -e 'r /tmp/tmp' | 
sed -n '/./!n;h;N;/\n$/D;G;s/\n/ /;P;D'

OUTPUT

Row1,10 Row1,10
Row1,10 Row2,20
Row1,10 Row3,30
Row1,10 Row4,40
Row2,20 Row1,10
Row2,20 Row2,20
Row2,20 Row3,30
Row2,20 Row4,40
Row3,30 Row1,10
Row3,30 Row2,20
Row3,30 Row3,30
Row3,30 Row4,40
Row4,40 Row1,10
Row4,40 Row2,20
Row4,40 Row3,30
Row4,40 Row4,40

I did this another way. It does store some in memory - it stores a string like:

"$1" -

... for each line in the file.

pairs(){ [ -e "$1" ] || return
    set -- "$1" "$(IFS=0 n=
        case "${0%sh*}" in (ya|*s) n=-1;; (mk|po) n=+1;;esac
        printf '"$1" - %s' $(printf "%.$(($(wc -l <"$1")$n))d" 0))"
    eval "cat -- $2 </dev/null | paste -d ' \n' -- $2"
}

It is very fast. It cat's the file as many times as there are lines in the file to a |pipe. On the other side of the pipe that input is merged with the file itself as many times as there are lines in the file.

The case stuff is just for portability - yash and zsh both add one element to the split, while mksh and posh both lose one. ksh, dash, busybox, and bash all split out to exactly as many fields as there are zeroes as printed by printf. As written the above renders the same results for every one of the above mentioned shells on my machine.

If the file is very long, there may be $ARGMAX issues with too many arguments in which case you would need to introduce xargs or similar as well.

Given the same input I used before the output is identical. But, if I were to go bigger...

seq 10 10 10000 | nl -s, >/tmp/tmp

That generates a file almost identical to what I used before (sans 'Row') - but at 1000 lines. You can see for yourself how fast it is:

time pairs /tmp/tmp |wc -l

1000000
pairs /tmp/tmp  0.20s user 0.07s system 110% cpu 0.239 total
wc -l  0.05s user 0.03s system 32% cpu 0.238 total

At 1000 lines there is some slight variation in performance between shells - bash is invariably the slowest - but because the only work they do anyway is generate the arg string (1000 copies of filename -) the effect is minimal. The difference in performance between zsh - as above - and bash is 100th of a second here.

Here's another version that should work for a file of any length:

pairs2()( [ -e "$1" ] || exit
    rpt() until [ "$((n+=1))" -gt "$1" ]
          do printf %s\\n "$2"
          done
    [ -n "${1##*/*}" ] || cd -P -- "${1%/*}" || exit
    : & set -- "$1" "/tmp/pairs$!.ln" "$(wc -l <"$1")"
    ln -s "$PWD/${1##*/}" "$2" || exit
    n=0 rpt "$3" "$2" | xargs cat | { exec 3<&0
    n=0 rpt "$3" p | sed -nf - "$2" | paste - /dev/fd/3
    }; rm "$2"
)

It creates a soft-link to its first arg in /tmp with a semi-random name so that it won't get hung-up on weird filenames. That's important because cat's args are fed to it over a pipe via xargs. cat's output is saved to <&3 while sed prints every line in the first arg as many times as there are lines in that file - and its script is also fed to it via a pipe. Again paste merges its input, but this time it takes only two arguments - again for its standard input and the link name /dev/fd/3.

That last - the /dev/fd/[num] link - should work on any linux system and many more besides, but if it doesn't creating a named pipe with mkfifo and using that instead should work as well.

The last thing it does is rm the soft-link it creates before exiting.

This version is actually faster still on my system. I guess it is because that though it execs more applications, it starts handing them their arguments immediately - whereas before it stacked them all first.

time pairs2 /tmp/tmp | wc -l

1000000
pairs2 /tmp/tmp  0.30s user 0.09s system 178% cpu 0.218 total
wc -l  0.03s user 0.02s system 26% cpu 0.218 total

Well, you could always do it in your shell:

while read i; do 
    while read k; do echo "$i $k"; done < sample.txt 
done < sample.txt

It is a good deal slower than your awk solution (on my machine, it took ~11 seconds for 1000 lines, versus ~0.3 seconds in awk) but at least it never holds more than a couple of lines in memory.

The loop above works for the very simple data you have in your example. It will choke on backslashes and it will eat trailing and leading spaces. A more robust version of the same thing is:

while IFS= read -r i; do 
    while IFS= read -r k; do printf "%s %s\n" "$i" "$k"; done < sample.txt 
done < sample.txt

Another choice is to use perl instead:

perl -lne '$line1=$_; open(A,"sample.txt"); 
           while($line2=<A>){printf "$line1 $line2"} close(A)' sample.txt

The script above will read each line of the input file (-ln), save it as $l, open sample.txt again, and print each line along with $l. The result is all pairwise combinations while only 2 lines are ever stored in memory. On my system, that took only about 0.6 seconds on 1000 lines.

Command line tool to "cat" pairwise expansion of all rows in a file

OUTPUT

Tags:

Shell

Awk

Text Processing

Shell Script

Related

Recent Posts