Concatenate multiple files with same header

Another solution, similar to "cat+grep" from above, using tail and head:

  1. Write the header of the first file into the output:

    head -2 file1.txt > all.txt
    

    -- head -2 gets 2 first lines of the file.

  2. Add the content of all the files:

    tail -n +3 -q file*.txt >> all.txt
    

    -- -n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.

And sure you can put both commands in one line:

head -2 file1.txt > all.txt; tail -n +3 -q file*.txt >> all.txt

or instead of ; put && between them for success check.


If you know how to do it in R, then by all means do it in R. With classical unix tools, this is most naturally done in awk.

awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' file*.txt >all.txt

The first line of the awk script matches the first line of a file (FNR==1) except if it's also the first line across all files (NR==1). When these conditions are met, the expression while (/^<header>/) getline; is executed, which causes awk to keep reading another line (skipping the current one) as long as the current one matches the regexp ^<header>. The second line of the awk script prints everything except for the lines that were previously skipped.


Try doing this :

$ cat file1.txt; grep -v "^<header" file2.txt
<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B 
C
D
E 
F

NOTE

  • the -v flag means to invert the match of grep
  • ^ in REGEX, means beginning of the string
  • if you have a bunch of files, you can do

:

array=( files*.txt )
{ cat ${array[@]:0:1}; grep -v "^<header" ${array[@]:1}; } > new_file.txt

It's a bash array slicing technique.