extract the 4th column from a csv file using unix command

UPDATE:

Actually, a much easier way is to set the record separator in gawk:

$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4}' myFile.csv
"col4
"4th column
"4th column2

However, this will remove the trailing " from the end of each column. To fix that you can print it yourself:

$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4"\""}' myFile.csv
"col4"
"4th column"
"4th column2"

If you don't want the quotes at all, you can set the field separator to ",":

$ gawk 'BEGIN{RS="\"\n"; FS="\",\""}{print $3}' myFile.csv
col3
4th column
4th column2

The only way I can think of One way of doing this is to first modify the file and then parse it. In your example, the newline that actually separates two records is always following a ":

"col1","col2","col3","col4"   <-- here 
1,"text1","<p>big             <-- no "

If that is the case for the entire file, you can replace all newlines that are not immediately after a " with a placeholder and so have everything in a single line. You can then parse normally with gawk and finally replace the placeholder with the newline again. I will use the string &%& as a placeholder since it is unlikely to exist in your file:

$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv | awk -F, '{print $4}'
"col4"
"4th column"
"4th column2"

The -p flag for perl means print each line of the input file after applying the script given by -e. Then there are 3 substitution (s/foo/bar/) commands:

  • s/"\s*\n/"&%&/ : This will find any " which is followed by 0 or more whitespace characters (\s*) and then a newline character (\n). It will replace that with "&%&. The quotes are added to preserve the format and the &%& is just a random placeholder, it could be anything that does not appear in your file.

  • s/\n//g; : since the real newlines have been replaced with the placeholder, we can now safely remove all remaining newlines in this record. This means that all lines of the current record have now been concatenated into the current line.

  • s/&%&/\n/ : This turns the placeholder back into a normal new line.

To understand the output of the command run it without gawk:

$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv 
"col1","col2","col3","col4"
1,"text1","<p>big      html     text</p>","4th column"
2,"text2","<p>big2      html2     text2</p>","4th column2"

So, you now have your long records on single lines and this is perfect food for gawk.

You can also do it directly in Perl:

perl -ne '$/="\"\n"; chomp;@a=split(/,/);print "$a[3]\"\n"' myFile.csv
"col4"
"4th column"
"4th column2"

This is using a bit more Perl magic. The $/special variable is the input record separator. By setting it to "\n we tell Perl to split lines not at \n but only at "\n" so that each record will be treated as a single line. Once that is done, chomp removes the newline from the end of the line (for printing later) and split splits each record (on ,) and saves it in the array @a. Finally, we print the 4th element of the array (arrays are numbered from 0 so that is $a[3]) which is the 4th column.

And even more magic, turn on auto spitting (-a) and split on commas (F","). This will split each record into the special @F array and you can print the 4th element of the array:

$ perl -F"," -ane '$/="\"\n";chomp;print "$F[3]"' myFile.csv
"col4"
"4th column"
"4th column2"

I would recommend using a battle-tested CSV parsing module. For example:

perl -MText::CSV -E '
    $csv = Text::CSV->new({binary=>1}); 
    while ($row = $csv->getline(STDIN)) {say $row->[3]}
' < file.csv
col4
4th column
4th column2

or this produces the same results:

ruby -rcsv -e 'CSV.foreach(ARGV.shift) {|row| puts row[3]}' file.csv

Python :

python -c "import csv,sys; print '\n'.join([ r[3] for r in csv.reader(open(sys.argv[1]))])" myfile.csv

A memory-conservative solution for large files that iterates through the file a line at a time unlike the above approach that loads the contents of the file into memory via a list

#!/usr/bin/env python
import sys
import csv
with open(sys.argv[1]) as f:
  for row in csv.reader(f):
    print(row[3])

TEST RESULT of all Solutions:

OS: Ubuntu 12.04

Public CSV Data Download from : http://seanlahman.com/baseball-archive/statistics/

Versions details

root@ubuntu:~# python --version
Python 2.7.3
root@ubuntu:~# ruby --version
ruby 1.8.7 (2011-06-30 patchlevel 352) [i686-linux]
root@ubuntu:~# perl --version

This is perl 5, version 14, subversion 2 (v5.14.2) built for i686-linux-gnu-thread-multi-64int

Result with time

root@ubuntu:~# time python -c "import csv,sys; print '\n'.join([ r[3] for r in csv.reader(open(sys.argv[1]))])" Master.csv > /tmp/python

real    0m1.112s
user    0m0.056s
sys     0m0.316s
root@ubuntu:~# time ruby -rcsv -e 'CSV.foreach(ARGV.shift) {|row| puts row[3]}' Master.csv > /tmp/ruby

real    0m24.582s
user    0m23.397s
sys     0m0.448s
root@ubuntu:~# time perl -MText::CSV -E '
>     $csv = Text::CSV->new({binary=>1});
>     while ($row = $csv->getline(STDIN)) {say $row->[3]}
> ' < Master.csv > /tmp/perl

real    0m7.049s
user    0m5.876s
sys     0m0.468s