Is there a more robust way to edit the pattern matched, and then replace it?

Although it has fallen out of fashion, few languages can match perl for text processing. For instance:

Assume only one set of numbers, copy to the end of the line:

 $ perl -pe 's/.*?a(\d+).*/$& $1/' file
 a11.t 11
 some text here
 a06.t 06
 some text here

Multiple sets of numbers, add both to the end

 $ cat file
 a11.t
 some text here
 a06.t
 some text here
 a11.t a54.g

 $ perl -pe '@nums=(/a(\d+)/g); s/$/ @nums/' file
 a11.t 11
 some text here 
 a06.t 06
 some text here 
 a11.t a54.g 11 54

sed here is the perfect tool for the task. However note that you almost never need to pipe several sed invocations together as a sed script can be made of several commands.

If you wanted to extract the first sequence of 2 decimal digits and append following a space to end of the line if found, you'd do:

sed 's/\([[:digit:]]\{2\}\).*$/& \1/' < your-file

If you wanted to do that only if it's found in second position on the line and following a a:

sed 's/^a\([[:digit:]]\{2\}\).*$/& \1/' < your-file

And if you don't want to do it if that sequence of 2 digits is followed by more digits:

sed 's/^a\([[:digit:]]\{2\}\)\([^[:digit:]].*\)\{0,1\}$/& \1/' < your-file

In terms of robustness it all boils down to answering the question: what should be matched? and what should not be?. That's why it's important to specify your requirements clearly, and also understand what the input may look like (like can there be digits in the lines where you don't want to find a match?, can there be non-ASCII characters in the input?, is the input encoded in the locale's charset? etc.).

Above, depending on the sed implementation, the input will be decoded into text based on the locale's charmap (see output of locale charmap), or interpreted as if each byte corresponded to a character and bytes 0 to 127 interpreted as per the ASCII charmap (assuming you're not on a EBCDIC based system).

For sed implementations in the first category, it may not work properly if the file is not encoded in the right charset. For those in the second category, it could fail if there are characters in the input whose encoding contains the encoding of decimal digits.

The simplest way is via the following:

$ perl -lne '$,=$"; print $_, /a(\d+)/' file
# or this 
$ perl -lpe 's/a(\d+).*\K/ $1/' file

$ awk '
    match($1, /^a[[:digit:]]+/) &&
    gsub(/$/, FS substr($1, RSTART+1, RLENGTH-1)) ||
  1' file

Note:it is safe to use substr within gsub's replacement portion since we already made sure it is pure digital.

Is there a more robust way to edit the pattern matched, and then replace it?

Tags:

Awk

Sed

Text Processing

Related

Recent Posts