What constitutes a 'field' for the cut command?

The term "field" is often times associated with tools such as cut and awk. A field would be similar to a columns worth of data, if you take the data and separate it using a specific character. Typically the character used to do this is a Space.

However as is the case with most tools, it's configurable. For example:

  • awk = awk -F"," ... - would separate by commas (i.e. ,).
  • cut = cut -d"," ... - would separate by commas (i.e. ,).

Examples

This first one shows how awk automatically will split on spaces.

$ echo "The rain in Spain." | awk '{print $1" "$4}'
The Spain.

This one shows how cut will split on spaces too.

$ echo "The rain in Spain." | cut -d" " -f1,4
The Spain.

Here we have a CSV list of column data that we're using cut to return columns 1 & 4.

$ echo "col1,col2,col3,co4" | cut -d"," -f1,4
col1,co4

Awk too can do this:

$ echo "col1,col2,col3,co4" | awk -F"," '{print $1","$4}'
col1,co4

Awk is also a little more adept at dealing with a variety of separation characters. Here it's dealing with Tabs along with Spaces where they're inter-mixed at the same time:

$ echo -e "The\t rain\t\t in Spain." | awk '{print $1" "$4}'
The Spain.

What about the -s switch to cut?

With respect to this switch, it's simply telling cut to not print any lines which do not contain the delimiter character specified via the -d switch.

Example

Say we had this file.

$ cat sample.txt 
This is a space string.
This is a space   and   tab string.
Thisstringcontainsneither.

NOTE: There are spaces and tabs in the 2nd string above.

Now when we process these strings using cut with and without the -s switch:

$ cut -d" " -f1-6 sample.txt 
This is a space string.
This is a space  
Thisstringcontainsneither.

$ cut -d" " -f1-6 -s sample.txt 
This is a space string.
This is a space  

In the 2nd example you can see that the -s switch has omitted any strings from the output that do not contain the delimiter, Space.


A field according to POSIX is any part of a line delimited by any of the characters in IFS, the "input field separator (or internal field separator)." The default value of this is space, followed by a horizontal tabulator, followed by a newline. With Bash you can run printf '%q\n' "$IFS" to see its value.


It depends on the utility in question, but for cut, a "field" starts at the beginning of a line of text, and includes everything up to the first tab. The second field runs from the character after the first tab, up to the next tab. And so on for third, fourth, ... Everything between tabs, or between start-of-line and tab, or between tab and end-of-line.

Unless you specify a field delimiter with the "-d" option: cut -d: -f2 would get you everything between first and second colon (':') characters.

Other utilities have different definitions, but a tab-character is common. awk is a good fall back if cut is too strict, as awk divides fields based on one or more whitespace characters. That's a little bit more natural in a lot of situations, but you have to know a bit of syntax. To print the second field according to awk:

awk '{print $2}'

sort is the one that tricks me. My current sort man page says something like "non-blank to blank transition" for a field seperator. For some reason it takes a few tries to get sort fields defined correctly. join apparently uses "delimited by whitespace" fields, which is what awk purports to do by default.

The moral of the story is to be careful, and experiment if you don't know.

Tags:

Linux

Cut