Multiple RUN vs. single chained RUN in Dockerfile, which is better?

When possible, I always merge together commands that create files with commands that delete those same files into a single RUN line. This is because each RUN line adds a layer to the image, the output is quite literally the filesystem changes that you could view with docker diff on the temporary container it creates. If you delete a file that was created in a different layer, all the union filesystem does is register the filesystem change in a new layer, the file still exists in the previous layer and is shipped over the networked and stored on disk. So if you download source code, extract it, compile it into a binary, and then delete the tgz and source files at the end, you really want this all done in a single layer to reduce image size.

Next, I personally split up layers based on their potential for reuse in other images and expected caching usage. If I have 4 images, all with the same base image (e.g. debian), I may pull a collection of common utilities to most of those images into the first run command so the other images benefit from caching.

Order in the Dockerfile is important when looking at image cache reuse. I look at any components that will update very rarely, possibly only when the base image updates and put those high up in the Dockerfile. Towards the end of the Dockerfile, I include any commands that will run quick and may change frequently, e.g. adding a user with a host specific UID or creating folders and changing permissions. If the container includes interpreted code (e.g. JavaScript) that is being actively developed, that gets added as late as possible so that a rebuild only runs that single change.

In each of these groups of changes, I consolidate as best I can to minimize layers. So if there are 4 different source code folders, those get placed inside a single folder so it can be added with a single command. Any package installs from something like apt-get are merged into a single RUN when possible to minimize the amount of package manager overhead (updating and cleaning up).

Update for multi-stage builds:

I worry much less about reducing image size in the non-final stages of a multi-stage build. When these stages aren't tagged and shipped to other nodes, you can maximize the likelihood of a cache reuse by splitting each command to a separate RUN line.

However, this isn't a perfect solution to squashing layers since all you copy between stages are the files, and not the rest of the image meta-data like environment variable settings, entrypoint, and command. And when you install packages in a linux distribution, the libraries and other dependencies may be scattered throughout the filesystem, making a copy of all the dependencies difficult.

Because of this, I use multi-stage builds as a replacement for building binaries on a CI/CD server, so that my CI/CD server only needs to have the tooling to run docker build, and not have a jdk, nodejs, go, and any other compile tools installed.

Official answer listed in their best practices ( official images MUST adhere to these )

Minimize the number of layer

You need to find the balance between readability (and thus long-term maintainability) of the Dockerfile and minimizing the number of layers it uses. Be strategic and cautious about the number of layers you use.

Since docker 1.10 the COPY, ADD and RUN statements add a new layer to your image. Be cautious when using these statements. Try to combine commands into a single RUN statement. Separate this only if it's required for readability.

More info: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/minimize-the-number-of-layers

Update: Multi stage in docker >17.05

With multi-stage builds you can use multiple FROM statements in your Dockerfile. Each FROM statement is a stage and can have its own base image. In the final stage you use a minimal base image like alpine, copy the build artefacts from previous stages and install runtime requirements. The end result of this stage is your image. So this is where you worry about the layers as described earlier.

As usual, docker has great docs on multi-stage builds. Here's a quick excerpt:

With multi-stage builds, you use multiple FROM statements in your Dockerfile. Each FROM instruction can use a different base, and each of them begins a new stage of the build. You can selectively copy artifacts from one stage to another, leaving behind everything you don’t want in the final image.

A great blog post about this can be found here: https://blog.alexellis.io/mutli-stage-docker-builds/

To answer your points:

Yes, layers are sort of like diffs. I don't think there are layers added if there's absolutely zero changes. The problem is that once you install / download something in layer #2, you can not remove it in layer #3. So once something is written in a layer, the image size can not be decreased anymore by removing that.
Although layers can be pulled in parallel, making it potentially faster, each layer undoubtedly increases the image size, even if they're removing files.
Yes, caching is useful if you're updating your docker file. But it works in one direction. If you have 10 layers, and you change layer #6, you'll still have to rebuild everything from layer #6-#10. So it's not too often that it will speed the build process up, but it's guaranteed to unnecessarily increase the size of your image.

Thanks to @Mohan for reminding me to update this answer.

It depends on what you include in your image layers. The key point is sharing as many layers as possible.

Bad Examples

Dockerfile

RUN yum install big-package && yum install package1

Dockerfile

RUN yum install big-package && yum install package2

Good Examples

Dockerfile

RUN yum install big-package
RUN yum install package1

Dockerfile

RUN yum install big-package
RUN yum install package2

Another suggestion is deleting is not so useful only if it happens on the same layer as the adding/installing action.

It seems the answers above are outdated. The docs note this:

Prior to Docker 17.05, and even more, prior to Docker 1.10, it was important to minimize the number of layers in your image. The following improvements have mitigated this need:

[...]

Docker 17.05 and higher add support for multi-stage builds, which allow you to copy only the artifacts you need into the final image. This allows you to include tools and debug information in your intermediate build stages without increasing the size of the final image.

and this:

Notice that this example also artificially compresses two RUN commands together using the Bash && operator, to avoid creating an additional layer in the image. This is failure-prone and hard to maintain.

Best practice seems to have changed to using multistage builds and keeping the Dockerfiles readable.

Multiple RUN vs. single chained RUN in Dockerfile, which is better?

Bad Examples

Good Examples

Tags:

Docker

Dockerfile

Related

Recent Posts