How to split a file by using keyword boundaries

You can use awk for the job:

$ curl -O https://raw.githubusercontent.com/qtproject/qt-mobility\
/d7f10927176b8c3603efaaceb721b00af5e8605b/demos/qmlcontacts/contents/\
example.vcf

$ gawk ' /BEGIN:VCARD/ { close(fn); ++a; fn=sprintf("card_%02d.vcf", a); 
        print "Writing: ", fn } { print $0 > fn; } ' example.vcf
Writing:  card_01.vcf
Writing:  card_02.vcf
Writing:  card_03.vcf
Writing:  card_04.vcf
Writing:  card_05.vcf
Writing:  card_06.vcf
Writing:  card_07.vcf
Writing:  card_08.vcf
Writing:  card_09.vcf

$ cat card_0* > all.vcf
$ cmp example.vcf all.vcf
$ echo $?
0

Details

The awk line works like this: a is counter that is incremented on each BEGIN:VCARD line and at the same time the output filename is constructed using sprintf (stored in fn). For each line the current line ($0) is appended to the current file (named fn).

The last echo $? means that the cmp was successful, i.e. all single files concatenated are equal to the original example vcf example.

Note that the output redirection in awk works differently than in shell. That means that with > fn awk first checks if the file is already open. If it is already open then awk appends to it. If it is not then it opens and truncates it.

Because of this redirection logic we have to explicitly close the implicitly opened files, since otherwise the call would hit the open file limit in cases where the input file contains many records.


csplit -f vcard input.txt -z '/END:VCARD/+1' '{*}'

Using GNU Parallel you can do:

cat foo.vcf | parallel --pipe -N1 --recstart BEGIN:VCARD 'cat >{#}'

Or if you can refute http://oletange.blogspot.com/2013/10/useless-use-of-cat.html you can use this instead:

< foo.vcf parallel --pipe -N1 --recstart BEGIN:VCARD 'cat >{#}'

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

10 seconds installation:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 67bd7bc7dc20aff99eb8f1266574dadb
12345678 67bd7bc7 dc20aff9 9eb8f126 6574dadb
$ md5sum install.sh | grep b7a15cdbb07fb6e11b0338577bc1780f
b7a15cdb b07fb6e1 1b033857 7bc1780f
$ sha512sum install.sh | grep 186000b62b66969d7506ca4f885e0c80e02a22444
6f25960b d4b90cf6 ba5b76de c1acdf39 f3d24249 72930394 a4164351 93a7668d
21ff9839 6f920be5 186000b6 2b66969d 7506ca4f 885e0c80 e02a2244 40e8a43f
$ bash install.sh

Tags:

Split

Files