Sed -- Replace first k instances of a word in the file

The first section belows describes using sed to change the first k-occurrences on a line. The second section extends this approach to change only the first k-occurrences in a file, regardless of what line they appear on.

Line-oriented solution

With standard sed, there is a command to replace the k-th occurrance of a word on a line. If k is 3, for example:

sed 's/old/new/3'

Or, one can replace all occurrences with:

sed 's/old/new/g'

Neither of these is what you want.

GNU sed offers an extension that will change the k-th occurrance and all after that. If k is 3, for example:

sed 's/old/new/g3'

These can be combined to do what you want. To change the first 3 occurrences:

$ echo old old old old old | sed -E 's/\<old\>/\n/g4; s/\<old\>/new/g; s/\n/old/g'
new new new old old

where \n is useful here because we can be sure that it never occurs on a line.

Explanation:

We use three sed substitution commands:

  • s/\<old\>/\n/g4

    This the GNU extension to replace the fourth and all subsequent occurrences of old with \n.

    The extended regex feature \< is used to match the beginning of a word and \> to match the end of a word. This assures that only complete words are matched. Extended regex requires the -E option to sed.

  • s/\<old\>/new/g

    Only the first three occurrences of old remain and this replaces them all with new.

  • s/\n/old/g

    The fourth and all remaining occurrences of old were replaced with \n in the first step. This returns them back to their original state.

Non-GNU solution

If GNU sed is not available and you want to change the first 3 occurrences of old to new, then use three s commands:

$ echo old old old old old | sed -E -e 's/\<old\>/new/' -e 's/\<old\>/new/' -e 's/\<old\>/new/'
new new new old old

This works well when k is a small number but scales poorly to large k.

Since some non-GNU seds do not support combining commands with semicolons, each command here is introduced with its own -e option. It may also be necessary to verify that your sed supports the word boundary symbols, \< and \>.

File-oriented solution

We can tell sed to read the whole file in and then perform the substitutions. For example, to replace the first three occurrences of old using a BSD-style sed:

sed -E -e 'H;1h;$!d;x' -e 's/\<old\>/new/' -e 's/\<old\>/new/' -e 's/\<old\>/new/'

The sed commands H;1h;$!d;x read the whole file in.

Because the above does not use any GNU extension, it should work on BSD (OSX) sed. Note, thought, that this approach requires a sed that can handle long lines. GNU sed should be fine. Those using a non-GNU version of sed should test its ability to handle long lines.

With a GNU sed, we can further use the g trick described above, but with \n replaced with \x00, to replace the first three occurrences:

sed -E -e 'H;1h;$!d;x; s/\<old\>/\x00/g4; s/\<old\>/new/g; s/\x00/old/g'

This approach scales well as k becomes large. This assumes, though, that \x00 is not in your original string. Since it is impossible to put the character \x00 in a bash string, this is usually a safe assumption.


Using Awk

The awk commands can be used to replace the first N occurrences of the word with the replacement.
The commands will only replace if the word is a complete match.

In the examples below, I am replacing the first 27 occurrences of old with new

Using sub

awk '{for(i=1;i<=NF;i++){if(x<27&&$i=="old"){x++;sub("old","new",$i)}}}1' file

This command loops through each field until it matches old, it checks the counter is below 27, increments and the substitutes the first match on the line. Then moves onto the next field/line and repeats.

Replacing the field manually

awk '{for(i=1;i<=NF;i++)if(x<27&&$i=="old"&&$i="new")x++}1' file

Similar to the command before but as it already has a marker on which field it is up to ($i), it simply changes the value of the field from old to new.

Performing a check before

awk '/old/&&x<27{for(i=1;i<=NF;i++)if(x<27&&$i=="old"&&$i="new")x++}1' file

Checking that the line contains old and the counter is below 27 SHOULD provide a small speed boost as it won't process lines when these are false.

RESULTS

E.g

old bold old old old
old old nold old old
old old old gold old
old gold gold old old
old old old man old old
old old old old dog old
old old old old say old
old old old old blah old

to

new bold new new new
new new nold new new
new new new gold new
new gold gold new new
new new new man new new
new new new new dog new
new new old old say old
old old old old blah old

Say you want to replace only the first three instances of a string...

seq 11 100 311 | 
sed -e 's/1/\
&/g'              \ #s/match string/\nmatch string/globally 
-e :t             \ #define label t
-e '/\n/{ x'      \ #newlines must match - exchange hold and pattern spaces
-e '/.\{3\}/!{'   \ #if not 3 characters in hold space do
-e     's/$/./'   \ #add a new char to hold space
-e      x         \ #exchange hold/pattern spaces again
-e     's/\n1/2/' \ #replace first occurring '\n1' string w/ '2' string
-e     'b t'      \ #branch back to label t
-e '};x'          \ #end match function; exchange hold/pattern spaces
-e '};s/\n//g'      #end match function; remove all newline characters

note: the above will likely not work with embedded comments
...or in my example case, of a '1'...

OUTPUT:

22
211
211
311

There I use two notable techniques. In the first place every occurrence of 1 on a line is replaced with \n1. In this way, as I do the recursive replacements next, I can be sure not to replace the occurrence twice if my replacement string contains my replace string. For example, if I replace he with hey it will still work.

I do this like:

s/1/\
&/g

Secondly, I am counting the replacements by adding a character to hold space for each occurrence. Once I reach three no more occur. If you apply this to your data and change the \{3\} to the total replacements you desire and the /\n1/ addresses to whatever you mean to replace, you should replace only as many as you wish.

I only did all of the -e stuff for readability. POSIXly It could be written like this:

nl='
'; sed "s/1/\\$nl&/g;:t${nl}/\n/{x;/.\{3\}/!{${nl}s/$/./;x;s/\n1/2/;bt$nl};x$nl};s/\n//g"

And w/ GNU sed:

sed 's/1/\n&/g;:t;/\n/{x;/.\{3\}/!{s/$/./;x;s/\n1/2/;bt};x};s/\n//g'

Remember also that sed is line-oriented - it does not read in the entire file and then attempt to loop back over it as is often the case in other editors. sed is simple and efficient. That said, it is often convenient to do something like the following:

Here is a little shell function that bundles it up into a simply executed command:

firstn() { sed "s/$2/\
&/g;:t 
    /\n/{x
        /.\{$(($1))"',\}/!{
            s/$/./; x; s/\n'"$2/$3"'/
            b t
        };x
};s/\n//g'; }

So with that I can do:

seq 11 100 311 | firstn 7 1 5

...and get...

55
555
255
311

...or...

seq 10 1 25 | firstn 6 '\(.\)\([1-5]\)' '\15\2'

...to get...

10
151
152
153
154
155
16
17
18
19
20
251
22
23
24
25

...or, to match your example (on a smaller order of magnitude):

yes linux | head -n 10 | firstn 5 linux 'linux is an os kernel'
linux is an os kernel
linux is an os kernel
linux is an os kernel
linux is an os kernel
linux is an os kernel
linux
linux
linux
linux
linux