Is there a field that stores the exact field separator FS used when in a regular expression, equivalent to RT for RS?

Is there a way to "repack" the fields using the specific field separator used to split each one of them

Using gnu-awk split() that has an extra 4th parameter for the matched delimiter using supplied regex:

s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {for (i=1; i in seps; i++) printf "%s%s", flds[i], seps[i]; print flds[i]}' <<< "$s"

hello;how|are you

A more readable version:

s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {
   for (i=1; i in seps; i++)
      printf "%s%s", flds[i], seps[i]
   print flds[i]
}' <<< "$s"

Take note of 4th seps parameter in split that stores an array of matched text by regular expression used in 3rd parameter i.e. /[;|]/.

Of course it is not as short & simple as RS, ORS and RT, which can be written as:

awk -v RS='[;|]' '{ORS = RT} 1' <<< "$s"

As @anubhava mentions, gawk has split() (and patsplit() which is to FPAT as split() is to FS - see https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions) to do what you want. If you want the same functionality with a POSIX awk then:

$ cat tst.awk
function getFldsSeps(str,flds,fs,seps,  nf) {
    delete flds
    delete seps
    str = $0

    if ( fs == " " ) {
        fs = "[[:space:]]+"
        if ( match(str,"^"fs) ) {
            seps[0] = substr(str,RSTART,RLENGTH)
            str = substr(str,RSTART+RLENGTH)
        }
    }

    while ( match(str,fs) ) {
        flds[++nf] = substr(str,1,RSTART-1)
        seps[nf]   = substr(str,RSTART,RLENGTH)
        str = substr(str,RSTART+RLENGTH)
    }

    if ( str != "" ) {
        flds[++nf] = str
    }

    return nf
}

{
    print
    nf = getFldsSeps($0,flds,FS,seps)
    for (i=0; i<=nf; i++) {
        printf "{%d:[%s]<%s>}%s", i, flds[i], seps[i], (i<nf ? "" : ORS)
    }
}

Note the specific handling above of the case where the field separator is " " because that means 2 things different from all other field separator values:

Fields are actually separated by chains of any white space, and
Leading white space is to be ignored when populating $1 (or flds[1] in this case) and so that white space, if it exists, must be captured in seps[0]` for our purposes since every seps[N] is associated with the flds[N] that precedes it.

For example, running the above on these 3 input files:

$ head file{1..3}
==> file1 <==
hello;how|are you

==> file2 <==
hello how are_you

==> file3 <==
    hello how are_you

we'd get the following output where each field is displayed as the field number then the field value within [...] then the separator within <...>, all within {...} (note that seps[0] is populated IFF the FS is " " and the record starts with white space):

$ awk -F'[,|]' -f tst.awk file1
hello;how|are you
{0:[]<>}{1:[hello;how]<|>}{2:[are you]<>}

$ awk -f tst.awk file2
hello how are_you
{0:[]<>}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}

$ awk -f tst.awk file3
    hello how are_you
{0:[]<    >}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}

An alternative option to split is to use match to find the field separators and read them into an array:

awk -F'[;|]' '{
    str=$0; # Set str to the line
    while (match(str,FS)) { # Loop through rach match of the field separator
      map[cnt+=1]=substr(str,RSTART,RLENGTH); # Create an array of the field separators
      str=substr(str,RSTART+RLENGTH) # Set str to the rest of the string after the match string
    }
    for (i=1;i<=NF;i++) { 
      printf "%s%s",$i,map[i] # Loop through each record, printing it along with the field separator held in the array map.
    } 
    printf "\n" 
   }' <<< "hello;how|are you"

Is there a field that stores the exact field separator FS used when in a regular expression, equivalent to RT for RS?

Tags:

Awk

Gnu

Related

Recent Posts