Filter column with awk and regexp

The way to write the script you posted:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt

in awk so it will do what you SEEM to be trying to do is:

awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt

Post some sample input and expected output to help us help you more.


This should do the trick:

awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file

Regexplanation:

^                        # Match the start of the string
(([1-9]|[1-9][0-9]|100)  # Match a single digit 1-9 or double digit 10-99 or 100
[SM]                     # Character class matching the character S or M
){2}                     # Repeat everything in the parens twice
$                        # Match the end of the string

You have quite a few issue with your statement:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
  • == is the string comparision operator. The regex comparision operator is ~.
  • You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
  • [0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
  • [SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.

Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.


I would do the regex check and the numeric validation as different steps. This code works with GNU awk:

$ cat data
a b c d e 132x123y
a b c d e 123S12M
a b c d e 12S23M
a b c d e 12S23Mx

We'd expect only the 3rd line to pass validation

$ gawk '
    match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
    1 <= m[1] && m[1] <= 100 && 
    1 <= m[2] && m[2] <= 100 {
        print
    }
' data
a b c d e 12S23M

For maintainability, you could encapsulate that into a function:

gawk '
    function validate6() {
        return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
                1<=m[1] && m[1]<=100 && 
                1<=m[2] && m[2]<=100 );
    }
    validate6() {print}
' data

Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."

You want something like this

/\d{1,3}[SM]\d{1,3}[SM]/

Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).

Tags:

Regex

Awk