awk: split file by column name and add header row to each file

The solution would be to store the header in a separate variable and print it on the first occurence of a new $1 value (=file name):

awk -F'|' 'FNR==1{hdr=$0;next} {if (!seen[$1]++) print hdr>$1; print>$1}' a.txt 
  • This will store the entire first line of a.txt in a variable hdr but otherwise leave that particular line unprocessed.
  • On all subsequent lines, we first check if the $1 value (=the desired output filename) was already encountered, by looking it up in an array seen which holds an occurence count of the various $1 values. If the counter is still zero for the current $1 value, output the header to the file indicated by $1, then increase the counter to suppress header output for all later occurences. The rest you already figured out yourself.

Addendum:

If you have more than one input file, which all have a header line, you can simply place them all as arguments to the awk call, as in

awk -F'|' ' ... ' a.txt b.txt c.txt ...

If, however, only the first file has a header line, you would need to change FNR to NR in the first rule.

Caveat

As noted by Ed Morton, the simple approach only works if the number of different output files is small (max. around 10). GNU awk will still continue working, but become slower due to automatically closing and opening files in the background as needed; other awk implementations may simply fail due to "too many open files".


This will work robustly and efficiently using any awk, sort, and cut:

$ cat tst.sh
#!/usr/bin/env bash

awk 'BEGIN{FS=OFS="|"} {print (NR>1), $1, NR, $0}' "$@" |
sort -t'|' -k1,1n -k2,2 -k3,3n |
cut -d'|' -f4- |
awk '
    BEGIN { FS=OFS="|" }
    NR == 1 { hdr = $0; next }
    $1 != prev {
        close(prev)
        print hdr " > " $1
        prev = $1
    }
    { print $0 " > " $1 }
'

$ ./tst.sh a.txt
filename|count|age > 1.txt
1.txt|1|15 > 1.txt
1.txt|2|14 > 1.txt
filename|count|age > 2.txt
2.txt|3|1 > 2.txt
2.txt|1|3 > 2.txt
filename|count|age > 41.txt
41.txt|44|1 > 41.txt

Change " > " to just > to actually create the output files when done testing.

The leading awk|sort|cut groups all of the input lines by the file name ($1) so that the final awk is only processing content for 1 output file at a time so it only has 1 output file open at a time and so won't fail with a "too many open file names" error once a dozen or so output files are created in non-gawk or run slower due to juggling opening/closing output files with gawk.

Here's what's happening at each of the earlier stages that set up the data for the final awk script to be able to parse it while only having 1 output file open at a time and retaining the original input order on a per output file name basis:

$ awk 'BEGIN{FS=OFS="|"} {print (NR>1), $1, NR, $0}' a.txt
0|filename|1|filename|count|age
1|1.txt|2|1.txt|1|15
1|1.txt|3|1.txt|2|14
1|2.txt|4|2.txt|3|1
1|41.txt|5|41.txt|44|1
1|2.txt|6|2.txt|1|3

$ awk 'BEGIN{FS=OFS="|"} {print (NR>1), $1, NR, $0}' a.txt |
    sort -t'|' -k1,1n -k2,2 -k3,3n
0|filename|1|filename|count|age
1|1.txt|2|1.txt|1|15
1|1.txt|3|1.txt|2|14
1|2.txt|4|2.txt|3|1
1|2.txt|6|2.txt|1|3
1|41.txt|5|41.txt|44|1

$ awk 'BEGIN{FS=OFS="|"} {print (NR>1), $1, NR, $0}' a.txt |
    sort -t'|' -k1,1n -k2,2 -k3,3n |
    cut -d'|' -f4-
filename|count|age
1.txt|1|15
1.txt|2|14
2.txt|3|1
2.txt|1|3
41.txt|44|1