How do programs that can resume failed file transfers know where to start appending data?

For clarity's sake - the real mechanics is more complicated to give even better security - you can imagine the write-to-disk operation like this:

application writes bytes (1)
the kernel (and/or the file system IOSS) buffers them
once the buffer is full, it gets flushed to the file system:
- the block is allocated (2)
- the block is written (3)
- the file and block information is updated (4)

If the process gets interrupted at (1), you don't get anything on the disk, the file is intact and truncated at the previous block. You sent 5000 bytes, only 4096 are on the disk, you restart transfer at offset 4096.

If at (2), nothing happens except in memory. Same as (1). If at (3), the data is written but nobody remembers about it. You sent 9000 bytes, 4096 got written, 4096 got written and lost, the rest just got lost. Transfer resumes at offset 4096.

If at (4), the data should now have been committed on disk. The next bytes in the stream may be lost. You sent 9000 bytes, 8192 get written, the rest is lost, transfer resumes at offset 8192.

This is a simplified take. For example, each "logical" write in stages 3-4 is not "atomic", but gives rise to another sequence (let's number it #5) whereby the block, subdivided into sub-blocks suitable for the destination device (e.g. hard disk) is sent to the device's host controller, which also has a caching mechanism, and finally stored on the magnetic platter. This sub-sequence is not always completely under the system's control, so having sent data to the hard disk is not a guarantee that it has been actually written and will be readable back.

Several file systems implement journaling, to make sure that the most vulnerable point, (4), is not actually vulnerable, by writing meta-data in, you guessed it, transactions that will work consistently whatever happens in stage (5).

If the system gets reset in the middle of a transaction, it can resume its way to the nearest intact checkpoint. Data written is still lost, same as case (1), but resumption will take care of that. No information actually gets lost.

Note: I have not looked at the sources of rsync or any other file transfer utility.

It is trivial to write a C program that jumps the the end of a file and gets the position of that location in bytes.

Both operations is done with a single call to the standard C library function lseek() (lseek(fd, 0, SEEK_END) returns the length of the file opened for file descriptor fd, measured in bytes).

Once that is done for the target file, a similar call to lseek() may be done on the source file to jump to the appropriate position: lseek(fd, pos, SEEK_SET). The transfer may then continue at that point, assuming the earlier portion of the source file has been identified as unchanged (different utilities may do this in different ways).

A file may be fragmented on the disk, but the filesystem will ensure that an application perceives the file as a sequential sequence of bytes.

Regarding the discussion in comments about bits and bytes: The smallest unit of data that may be written to disk is a byte. A single byte requires at least one block of data to be allocated on disk. The size of a block is dependent on the type of filesystem and possibly also on the parameters used by the administrator when initializing the filesystem, but it's usually somewhere between 512 bytes and 4 KiB. Write operations may be buffered by the kernel, the underlying C library or by the application itself and the actual writing to disk may happen in multiples of the appropriate block size as an optimization.

It is not possible to write single bits to a file and if a write operation fails, it will not leave "half-written bytes" in the file.

This are basically two questions, because programs like curl and rsync are very different.

For HTTP clients like curl they check the size of the current file and then send a Content-Range header with their request. The server either resumes sending the range of the file using status code 206 (partial content) instead of 200 (success) and the download is resumed or it ignores the header and starts from the beginning and the HTTP client has no other choice than re-download everything again.

Further the server may or may not send a Content-Length header. You may have noticed that some downloads are not showing a percentage and filesize. These are downloads where the server does not tell the client the length, so the client only knows the amount it downloaded but not how many bytes will follow.

Using a Content-Range header with start and stop position is used by some download manager to download a file from different sources at once, which speeds up the transfer if each mirror by itself is slower than your network connection.

rsync on the other hand is an advanced protocol for incremental file transfers. It generates checksums of parts of the file on the server and client side to detect which bytes are the same. Then it only sends the differences. This means it cannot only resume a download, but it even can download the changed bytes if you changed a few bytes in the middle of a very large file without re-downloading the file.

Another protocol made for resuming transfers is bittorrent, where the .torrent file contains a list of checksums for blocks from the file, so blocks can be downloaded and verified in arbitrary order and in parallel from different sources.

Note that rsync and bittorent will verify the partial data on your disk, while resuming a HTTP download will not. So if you suspect the partial data to be corrupted you need to check the integrity otherwise, i.e. using a checksum of the final file. But just interrupting the download or losing the network connection usually does not corrupt the partial file while a power failure during the transfer may do.

How do programs that can resume failed file transfers know where to start appending data?

Tags:

File Copy

Related

Recent Posts