Why does writing to an inherited file handle from a python sub-process result in not all rows being written?

This problem is due to a combination of:

  • fork copying the file descriptor from parent to child; and
  • buffering; and
  • the lack of an implicit flush as each child exits

Forking processes results in the parent and child sharing a posix file descriptor. In the presence of raw writes this should not result in data loss, but without any form of synchronisation between parent and child it always results in scrambled interleaving of data.

However in the presence of independent buffering by the processes, data may be lost depending on how the buffered write is implemented.

So ... a useful experiment in this case would involve replicating your problem with no buffering involved. This could be done in two ways:

  • using an open(..., mode='ab', buffering=0) ... and then as this is a binary file ensuring that all writes encode to bytes using

    file_handle.write(bytes(s+"\n", encoding="utf-8"))
    

    Doing so results in a file with 30,000 lines of size 3030000 bytes (as expected)

  • jump through some hoops to open the file as as an io.TextIOWrapper with non-default options that disable the buffering. We are unable to control the flags we need via open so instead create it as:

    file_handle = io.TextIOWrapper(
        io.BufferedWriter(
            io.FileIO("out.txt", mode="a"),
            buffer_size=1),
        newline='', encoding="utf-8", 
        write_through=True)
    

    This will also result in a file of 30,000 lines of size 3030000 bytes (as expected)

On Python 3.7, as commenters have noted, the original code results in a file with 29,766 lines rather than 30,000. This is 78 lines short per worker. Running that code with two workers produces a file with 19,844 lines (which is also 78 lines short per worker).

Why? It is standard practice to exit a forked child process using os._exit and it appears that this is not flushing the remaining buffer in each child to disk ... this explains the missing 78 lines per child exactly.

  • On my machine, the default buffer size (io.DEFAULT_BUFFER_SIZE) is 8192 bytes.
  • Each line consists of 101 bytes. This means the buffer will overrun and be flushed every ceil(8192 / 101) = 82 lines. That is, 81 lines will almost fill the buffer and the 82nd line will cause the preceding 81 lines and itself to be flushed.
  • Thus, after 10,000 iterations we have 10,000 % 82 = 78 lines remaining in the buffer in each child.

Thus it would appear the missing data is buffered data that has not been flushed. So, making the following change:

def write_random_rows(n):
    ...
    except Exception:
        traceback.print_exc()

    # flush the file
    file_handle.flush()

will result in the desired 30,000 lines.

NOTE:

In either case, it is almost always better to ensure a child process is not sharing a file handle by either deferring the open to the child, or dup'ing any open file handles across a fork.


File descriptors and their positions are shared across fork() on POSIX systems, as described in this other answer. That is likely to cause all sorts of issues when writing concurrently. It is indeed curious that it is so consistent from run to run though.

It makes sense that is reliable when using separate file descriptors though. POSIX guarantees this when using O_APPEND.