How does `yes` write to file so quickly?

nutshell:

yes exhibits similar behavior to most other standard utilities which typically write to a FILE STREAM with output buffered by the libC via stdio. These only do the syscall write() every some 4kb (16kb or 64kb) or whatever the output block BUFSIZ is . echo is a write() per GNU. That's a lot of mode-switching (which is not, apparently, as costly as a context-switch).

And that's not at all to mention that, besides its initial optimization loop, yes is a very simple, tiny, compiled C loop and your shell loop is in no way comparable to a compiler optimized program.


but i was wrong:

When I said before that yes used stdio, I only assumed it did because it behaves a lot like those that do. This was not correct - it only emulates their behavior in this way. What it actually does is very like an analog to the thing I did below with the shell: it first loops to conflate its arguments (or y if none) until they might grow no more without exceeding BUFSIZ.

A comment from the source immediately preceding the relevant for loop states:

/* Buffer data locally once, rather than having the
large overhead of stdio buffering each item.  */

yes does its does its own write()s thereafter.


digression:

(As originally included in the question and retained for context to a possibly informative explanation already written here):

I've tried timeout 1 $(while true; do echo "GNU">>file2; done;) but unable to stop loop.

The timeout problem you have with the command substitution - I think I get it now, and can explain why it doesn't stop. timeout doesn't start because its command-line is never run. Your shell forks a child shell, opens a pipe on its stdout, and reads it. It will stop reading when the child quits, and then it will interpret all the child wrote for $IFS mangling and glob expansions, and with the results it will replace everything from $( to the matching ).

But if the child is an endless loop that never writes to the pipe, then the child never stops looping, and timeout's command-line is never completed before (as I guess) you do CTRL-C and kill the child loop. So timeout can never kill the loop which needs to complete before it can start.


other timeouts:

...simply aren't as relevant to your performance issues as the amount of time your shell program must spend switching between user- and kernel-mode to handle output. timeout, though, is not as flexible as a shell might be for this purpose: where shells excel is in their ability to mangle arguments and manage other processes.

As is noted elsewhere, simply moving your [fd-num] >> named_file redirection to the loop's output target rather than only directing output there for the command looped over can substantially improve performance because that way at least the open() syscall need only be done the once. This also is done below with the | pipe targeted as output for the inner loops.


direct comparison:

You might do like:

for cmd in  exec\ yes 'while echo y; do :; done'
do      set +m
        sh  -c '{ sleep 1; kill "$$"; }&'"$cmd" | wc -l
        set -m
done

256659456
505401

Which is kind of like the command sub relationship described before, but there's no pipe and the child is backgrounded until it kills the parent. In the yes case the parent has actually been replaced since the child was spawned, but the shell calls yes by overlaying its own process with the new one and so the PID remains the same and its zombie child still knows who to kill after all.


bigger buffer:

Now lets see about increasing the shell's write() buffer.

IFS="
";    set y ""              ### sets up the macro expansion       
until [ "${512+1}" ]        ### gather at least 512 args
do    set "$@$@";done       ### exponentially expands "$@"
printf %s "$*"| wc -c       ### 1 write of 512 concatenated "y\n"'s  

1024

I chose that number because output strings any longer than 1kb were getting split out into separate write()'s for me. And so here's the loop again:

for cmd in 'exec  yes' \
           'until [ "${512+:}" ]; do set "$@$@"; done
            while printf %s "$*"; do :; done'
do      set +m
        sh  -c $'IFS="\n"; { sleep 1; kill "$$"; }&'"$cmd" shyes y ""| wc -l
        set -m
done

268627968
15850496

That's 300 times the amount of data written by the shell in the same amount of time for this test than the last. Not too shabby. But it's not yes.


related:

As requested, there is a more thorough description than the mere code comments on what is done here at this link.


A better question would be why is your shell writing the file so slowly. Any self-contained compiled program that uses file writing syscalls responsibly (not flushing every character at a time) would do it reasonably quicky. What you are doing, is writing lines in an interpreted language (the shell), and in addition you do a lot of unnecessary input output operations. What yesdoes:

  • opens a file for writing
  • calls optimized and compiled functions for writing to a stream
  • the stream is buffered, so a syscall (an expensive switch to kernel mode) happens very rarely, in large chunks
  • closes a file

What your script does:

  • reads in a line of code
  • interprets the code, making a lot of extra operations to actually parse your input and figure out what to do
  • for each iteration of while loop (which is probably not cheap in an interpreted language):
    • call the date external command and store its output (only in the original version - in the revised version you gain a factor of 10 by not doing this)
    • test whether the loop's termination condition is met
    • open a file in append mode
    • parse echo command, recognize it (with some pattern matching code) as a shell builtin, call parameter expansion and everything else on the argument "GNU", and finally write the line to the open file
    • close the file again
    • repeat the process

The expensive parts: the whole interpretation is extremely expensive (bash is doing an awful lot of preprocessing of all the input - your string could potentially contain variable substitution, process substitution, brace expansion, escape characters and more), every call of a builtin is probably a switch statement with redirect to a function that deals with the builtin, and very importantly, you open and close a file for each and every line of output. You could put >> file outside the while loop to make it a lot quicker, but you're still in an interpreted language. You are quite lucky that echo is a shell builtin, not an external command - otherwise, your loop would involve creating a new process (fork & exec) on every single iteration. Which would grind the process to a halt - you saw how costly that is when you had the date command in the loop.


The other answers have addressed the main points. On a side note, you can increase the throughput of your while loop by writing to the output file at the end of the computation. Compare:

$ i=0;time while  [ $i -le 1000 ]; do ((++i)); echo "GNU" >>/tmp/f; done;

real    0m0.080s
user    0m0.032s
sys     0m0.037s

with

$ i=0;time while  [ $i -le 1000 ]; do ((++i)); echo "GNU"; done>>/tmp/f;

real    0m0.030s
user    0m0.019s
sys     0m0.011s