Are files opened by processes loaded into RAM?

No, a file is not automatically read into memory by opening it. That would be awfully inefficient. sed, for example, reads its input line by line, as do many other Unix tools. It seldom has to keep more than the current line in memory.

With awk it's the same. It reads a record at a time, which by default is a line. If you store parts of the input data in variables, that will be extra, of course1.

Some people have a habit of doing things like

for line in $(cat file); do ...; done

Since the shell will have to expand the $(cat file) command substitution completely before running even the first iteration of the for loop, this will read the whole of file into memory (into the memory used by the shell executing the for loop). This is a bit silly and also inelegant. Instead, one should do

while IFS= read -r line; do ...; done <file

This will process file line by line (but do read Understanding "IFS= read -r line").

Processing files line by line in the shell is only seldom needed though, as most utilities are line-oriented anyway (see Why is using a shell loop to process text considered bad practice?).

I'm working in bioinformatics, and when processing huge amounts of genomic data I would not be able to do much unless I only kept the bits of the data that were absolutely necessary in memory. For example, when I need to strip the bits of data that could be used to identify individuals from a 1 terabyte dataset containing DNA variants in a VCF file (because that type of data can't be made public), I do line by line processing with a simple awk program (this is possible since the VCF format is line-oriented). I do not read the file into memory, process it there, and write it back out again! If the file was compressed, I would feed it through zcat or gzip -d -c, which, since gzip does stream processing of data, would also not read the whole file into memory.

Even with file formats that are not line oriented, like JSON or XML, there are stream parsers that makes it possible to process huge files without storing it all in RAM.

With executables, it's slightly more complicated since shared libraries may be loaded on demand, and/or be shared between processes (see Loading of shared libraries and RAM usage, for example).

Caching is something I haven't mentioned here. This is the action of using RAM to hold frequently accessed pieces of data. Smaller files (for example executables) may be cached by the OS in the hope that the user will make many references to them. Apart from the first reading of the file, subsequent accesses will be made to RAM rather than to disk. Caching, like buffering of input and output is usually largely transparent to the user and amount of memory used to cache things may dynamically change depending on the amount of RAM allocated by applications etc.


1 Technically, most programs probably read a chunk of the input data at a time, either using explicit buffering, or implicitly through the buffering that the standard I/O libraries do, and then present that chunk line by line to the user's code. It's much more efficient to read a multiple of the disk's block size than e.g. a character at a time. This chunk size will seldom be larger than a handful of kilobytes though.


However when commands are being run, a copy of their files from the hard disk is put into the RAM,

This is wrong (in general). When a program is executed (thru execve(2)...) the process (running that program) is changing its virtual address space and the kernel is reconfiguring the MMU for that purpose. Read also about virtual memory. Notice that application programs can change their virtual address space using mmap(2) & munmap & mprotect(2), also used by the dynamic linker (see ld-linux(8)). See also madvise(2) & posix_fadvise(2) & mlock(2).

Future page faults will be processed by the kernel to load (lazily) pages from the executable file. Read also about thrashing.

The kernel maintains a large page cache. Read also about copy-on-write. See also readahead(2).

OK, so what I wonder about is if the double life of a command, one on the hard disk, the other in the RAM is also true for other kind of files, for instance those who have no logic programmed, but are simply containers for data.

For system calls like read(2) & write(2) the page cache is also used. If the data to be read is sitting in it, no disk IO will be done. If disk IO is needed, the read data would be very likely put in the page cache. So, in practice, if you run the same command twice, it could happen that no physical I/O is done to the disk on the second time (if you have an old rotating hard disk - not an SSD - you might hear that; or observe carefully your hard disk LED).

I recommend reading a book like Operating Systems : Three Easy Pieces (freely downloadable, one PDF file per chapter) which explains all this.

See also Linux Ate My RAM and run commands like xosview, top, htop or cat /proc/self/maps or cat /proc/$$/maps (see proc(5)).

PS. I am focusing on Linux, but other OSes also have virtual memory and page cache.


No. While having gigs of RAM these days is fantastic, there was a time when RAM was a very limited resource (I learned programming on a VAX 11/750 with 2MB of RAM) and the only thing in RAM was active executable and data pages of active processes, and file data that was in the buffer cache.
The buffer cache was flushed, and data pages were swapped out. And frequently at times. The read-only executable pages were over written and page tables marked so if the program touched those pages again they were paged in from the filesystem. Data was paged in from swap. As noted above, the STDIO library pulled in data in blocks and were obtained by the program as needed: fgetc, fgets, fread, etc. With mmap, a file could be mapped into the address space of a process, such as is done with shared library objects or even regular files. Yes, you may have some degree of control if its in RAM or not (mlock), but it only goes so far (see the error code section of mlock).

Tags:

Memory

Lsof

Files