How to speed up a complex image processing?

0. Two approaches

Basically, this challenge can be tackled in two different ways, or a combination of the two:

  1. Construct your commands as clever as possible.
  2. Trade speed-up gains for quality losses.

The next few sections discuss the both approaches.

1. Check which ImageMagick you've got: 'Q8', 'Q16', 'Q32' or 'Q64'?

First, check for your exact ImageMagick version and run:

convert -version

In case your ImageMagick has a Q16 (or even Q32 or Q64, which is possible, but overkill!) in its version string: This means, all of ImageMagick's internal functions treat all images as having 16 bit (or 32 or 64 bit) channel depths. This gives you a better quality in image processing. But it also requires double memory as compared to Q8. So at the same time it means a performance degradation.

Hence: you could test what performance benefits you'll achieve by switching to a Q8-build. (The Q is symbol for the 'quantum depth' supported by a ImageMagick build.)

You'll pay your possible Q8-performance gains with quality loss, though. Just check what speed up you achieve with Q8 over Q16, and what quality losses you suffer. Then decide whether you can live with the drawbacks or not...

In any case Q16 will use twice as much RAM per image to process, and Q32 will again use twice the amount of Q16. This is independent from the actual bits-per-pixels seen in the input files. 16-bit image files, when saved, will also consume more disk space than 8-bit ones.

With Q16 or Q32 requiring more memory, you always have to ensure that you have enough of it. Because exceeding your physical memory would be very bad news. If a larger Q makes a process swap to disk, performance will plummet. A 1074 x 768 pixel image (width x height) will require the following amounts of virtual memory, depending on the quantum depth:

Quantum                   Virtual Memory
  Depth    (consumed by 1 image 1024x768)
-------    ------------------------------  
      8         3.840 kiB  (=~  3,75 MiB)
     16         7.680 kiB  (=~  7,50 MiB)
     32        15.360 kiB  (=~ 14,00 MiB)
     

Also keep in mind, that some 'optimized' processing pipelines (see below) will need to keep several copies of an image in virtual memory! Once virtual memory cannot be satisfied by available RAM, the system will start to swap and claim "memory" from the disk. In that case, all clever command pipeline optimization is of course gone, and starts to knock over to the very reverse.

ImageMagick's birthday was in the aera when CPUs could handle only 1 bit at a time. That was decades ago. Since then CPU architecture has changed a lot. 16-bit operations used to take twice as long as 8-bit operations, or even longer. Then 16-bit processors arrived. 16-bit ops became standard. CPUs were optimised for 16-bit: Suddenly some 8-bit operations could take even longer than 16-bit equivalents.

Nowadays, 64bit CPUs are common. So the Q8 vs. Q16 vs. Q32 argument in real terms may even be void. Who knows? I'm not aware of any serious benchmarking about this. It would be interesting if someone (with really deep knowhow about CPUs and about benchmarking real world programs) would run with such a project one day.

Yes, I see you are using Q16 on Windows. But I still wanted to mention it, for completeness' sake... In the future there will be other users reading this question and the answers given.

Very likely, since your input TIFFs are black+white only, the image quality output of a Q8 build will be good enough for your workflow. (I just don't know if it would also be significantly faster: this largely also depends on the hardware resources you are running this on...)

In addition, if your installation sports support HDRI (high dynamic resolution images), this may also cause some speed penalty. Who knows? So building IM with configure options --disable-hdri --with-quantum-depth=8 may or may not lead to speed improvements. Nobody has ever tested this in a serious way... The only thing we know about this: these options will decrease image quality. However most people will not even notice this, unless they take really close looks and make direct image-by-image comparisons...

 

2. Check your ImageMagick's capabilities

Next, check if your ImageMagick installation comes with OpenCL and/or OpenMP support:

convert -list configure | grep FEATURES

If it does (like mine), you should see something like this:

FEATURES      DPC HDRI OpenCL OpenMP Modules

OpenCL (for C omputing L anguage) utilizes ImageMagick's parallel computing features (if compiled-in). This will make use of your computer's GPU additionally to the CPU for image processing operations.

OpenMP (for M ulti-P rocessing) does something similar: it allows ImageMagick to execute in parallel on all the cores of your system. So if you have a quad-core system, and resize an image, the resizing happens on 4 cores (or even 8 if you have hyperthreading).

The command

convert -version 

prints some basic info about supported features. If OpenCL/OpenMP are available, you will see one of them (or both) in the output.

If none of the two show up: look into getting the most recent version of ImageMagick that has OpenCL and/or OpenMP support compiled in.

If you build the package yourself from the sources, make sure OpenCL/OpenMP are used. Do this by including the appropriate parameters into your 'configure' step:

./configure  [...other options-]  --enable-openmp  --enable-opencl

ImageMagick's documentation about OpenMP and OpenCL is here:

  • Parallel Execution With OpenMP. Read it carefully. Because OpenMP is not a silver bullet, and it does not work under all circumstances...
  • Parallel Execution With OpenCL. The same as above applies here. Additionally, not all ImageMagick operations are OpenCL-enabled. The link here has a list of those which are. -resize is one of them.

Hints and instructions to build ImageMagick from sources and configure the build, explaining various options, are here:

  • ImageMagick Advanced Unix Installation

This page also includes a short discussion of the --with-quantum-depth configure option.

3. Benchmark your ImageMagick

You can now also use the builtin -bench option to make ImageMagick run a benchmark for your command. For example:

convert logo: -resize 500% -bench 10 logo.png

  [....]
  Performance[4]: 10i 1.489ips 1.000e 6.420u 0:06.510

Above command with -resize 500% tells ImageMagick to run the convert command and scale the built-in IM logo: image by 500% in each direction. The -bench 10 part tells it to run that same command 10 times in a loop and then print the performance results:

  • Since I have OpenMP enabled, I have 4 threads (Performance[4]:).
  • It reports that it ran 10 iterations (10i).
  • The speed was nearly 1.5 iterations per second (1.489ips).
  • Total user-alotted time was 6.420 seconds.

If your result includes Performance[1]:, and only one line, then your system does not have OpenMP enabled. (You may be able to switch it on, if your build does support it: run convert -limit thread 2.)

4. Tweak your ImageMagick's resource limits

Find out how your system's ImageMagick is set up regarding resource limits. Use this command:

identify -list resource
  File       Area     Memory     Map       Disk    Thread         Time
  --------------------------------------------------------------------
   384    8.590GB       4GiB    8GiB  unlimited         4    unlimited

Above shows my current system's settings (not the defaults -- I did tweak them in the past). The numbers are the maximum amount of each resource ImageMagick will use. You can use each of the keywords in the column headers to pimp your system. For this, use convert -limit <resource> <number> to set it to a new limit.

Maybe your result looks more like this:

identify -list resource
  File       Area     Memory     Map       Disk    Thread         Time
  --------------------------------------------------------------------
   192    4.295GB       2GiB    4GiB  unlimited         1    unlimited
  • The files defines the max concurrently opened files which ImageMagick can use.
  • The memory, map, area and disk resource limits are defined in Bytes. For setting them to different values you can use SI prefixes, .e.g 500MB).

When you do have OpenMP for ImageMagick on your system, you can run.

convert -limit thread 2

This enable 2 parallel threads as a first step. Then re-run the benchmark and see if it really makes a difference, and if so how much. After that you could set the limit to 4 or even 8 and repeat the excercise....

5. Use Magick Pixel Cache (MPC) and/or Magick Persistent Registry (MPR)

Finally, you can experiment with a special internal format of ImageMagick's pixel cache. This format is called MPC (Magick Pixel Cache). It only exists in memory.

When MPC is created, the processed input image is kept in RAM as an uncompressed raster format. So basically, MPC is the native in-memory uncompressed file format of ImageMagick. It is simply a direct memory dump to disk. A read is a fast memory map from disk to memory as needed (similar to memory page swapping). But no image decoding is needed.

(More technical details: MPC as a format is not portable. It also isn't suitable as a long-term archive format. Its only suitability is as an intermediate format for high-performance image processing. It requires two files to support one image.)

If you still want to save this format to disk, be aware of this:

  • Image attributes are written to a file with the extension .mpc.
  • Image pixels are written to a file with the extension .cache.

Its main advantage is experienced when...

  1. ...processing very large images, or when
  2. ...applying several operations on one and the same image in "opertion pipelines".

MPC was designed especially for workflow patterns which match the criteria "read many times, write once".

Some people say that for such operations the performance improves here, but I have no personal experience with it.

Convert your base picture to MPC first:

convert input.jpeg input.mpc

and only then run:

convert input.mpc [...your long-long-long list of crops and operations...]

Then see if this saves you significantly on time.

Most likely you can use this MPC format even "inline" (using the special mpc: notation, see below).

The MPR format (memory persistent register) does something similar. It reads the image into a named memory register. Your process pipeline can also read the image again from that register, should it need to access it multiple times. The image persists in the register the current command pipeline exits.

But I've never applied this technique to a real world problem, so I can't say how it works out in real life.

6. Construct a suitable IM processing pipeline to do all tasks in one go

As you describe your process, it is composed of 4 distinguished steps:

  1. Convert a TIFF to a JPEG.
  2. Resize the JPEG image to xx (?? what value ??)
  3. Crop the JPEG to 200px.
  4. Add a text watermark.

Please tell if I understand correctly your intentions from reading your code snippets:

  • You have 1 input file, a TIFF.
  • You want 2 final output files:
    1. 1 thumbnail JPEG, sized 200x200 pixels;
    2. 1 labelled JPEG, with a width of 1024 pixels (height keeping aspect ratio of input TIFF);
    3. 1 (unlabelled) JPEG is only an intermediate file which you do not really want to keep.

Basically, each step uses its own command -- 4 different commands in total. This can be sped up considerably by using a single command pipeline which performs all the steps on its own.

Moreover, you seem to not really need to keep the unlabelled JPEG as an end result -- yet your one command to generate it as an intermediate temporary file saves it to disk. We can try to skip this step altogether then, and try to achieve the final result without this extra write to disk.

There are different approaches possible to this change. I'll show you (and other readers) only one for now -- and only for the CLI, not for PHP. I'm not a PHP guy -- it's your own job to 'translate' my CLI method into appropriate PHP calls.

(But by all means: please test with my commands first, really using the CLI, to see if the effort is worth while translating the approach to PHP!)

But please make first sure that you really understand the architecture and structure of more complex ImageMagick's command lines! For this goal, please refer to this other answer of mine:

  • ImageMagick Command-Line Option Order (and Categories of Command-Line Parameters)

Your 4 steps translate into the following individual ImageMagick commands:

convert image.tiff image.jpg

convert image.jpg -resize 1024x image-1024.jpg

convert image-1024.jpg -thumbnail 200x200 image-thumb.jpg

convert -background white image-1024.jpg label:12345 -append image-labelled.jpg

Now to transform this workflow into one single pipeline command... The following command does this. It should execute faster (regardless of what your results are when following my above steps 0.--4.):

convert image.tiff                                                             \
 -respect-parentheses                                                          \
 +write mpr:XY                                                                 \
  \( mpr:XY                                       +write image-1024.jpg \)     \
  \( mpr:XY -thumbnail 200x200                    +write image-thumb.jpg \)    \
  \( mpr:XY -background white label:12345 -append +write image-labelled.jpg \) \
  null:

Explanations:

  • -respect-parentheses : required to really make independent from each other the sub-commands executed inside the \( .... \) parentheses.
  • +write mpr:XY : used to write the input file to an MPR memory register. XY is just a label (you can use anything), needed to later re-call the same image.
  • +write image-1024.jpg : writes result of subcommand executed inside the first parentheses pair to disk.
  • +write image-thumb.jpg : writes result of subcommand executed inside the second parentheses pair to disk.
  • +write image-labelled.jpg : writes result of subcommand executed inside the third parentheses pair to disk.
  • null: : terminates the command pipeline. Required because we otherwise would end with the last subcommand's closing parenthesis.

7. Benchmarking 4 individual commands vs. the single pipeline

In order to get a rough feeling about my suggestion, I did run the commands below.

The first one runs the sequence of the 4 individual commands 100 times (and saves all resulting images under different file names).

time for i in $(seq -w 1 100); do
   convert image.tiff                                                          \
                                               image-indiv-run-${i}.jpg
   convert image-indiv-run-${i}.jpg -sample 1024x                              \
                                               image-1024-indiv-run-${i}.jpg
   convert image-1024-indiv-run-${i}.jpg -thumbnail 200x200                    \
                                               image-thumb-indiv-run-${i}.jpg
   convert -background white image-1024-indiv-run-${i}.jpg label:12345 -append \
                                               image-labelled-indiv-run-${i}.jpg
   echo "DONE: run indiv $i ..."
done

My result for 4 individual commands (repeated 100 times!) is this:

real  0m49.165s
user  0m39.004s
sys   0m6.661s

The second command times the single pipeline:

time for i in $(seq -w 1 100); do
    convert image.tiff                                        \
     -respect-parentheses                                     \
     +write mpr:XY                                            \
      \( mpr:XY -resize 1024x                                 \
                +write image-1024-pipel-run-${i}.jpg     \)   \
      \( mpr:XY -thumbnail 200x200                            \
                +write image-thumb-pipel-run-${i}.jpg    \)   \
      \( mpr:XY -resize 1024x                                 \
                -background white label:12345 -append         \
                +write image-labelled-pipel-run-${i}.jpg \)   \
     null:
   echo "DONE: run pipeline $i ..."
done

The result for single pipeline (repeated 100 times!) is this:

real   0m29.128s
user   0m28.450s
sys    0m2.897s

As you can see, the single pipeline is about 40% faster than the 4 individual commands!

Now you can also invest in multi-CPU, much RAM, fast SSD hardware to speed things up even more :-)

But first translate this CLI approach into PHP code...


There are a few more things to be said about this topic. But my time runs out for now. I'll probably return to this answer in a few days and update it some more...


Update: I had to update this answer with new numbers for the benchmarking: initially I had forgotten to include the -resize 1024x operation (stupid me!) into the pipelined version. Having included it, the performance gain is still there, but not as big any more.


8. Use -clone 0 to copy image within memory

Here is another alternative to try instead of the mpr: approach with a named memory register as suggested above.

It uses (again within 'side processing inside parentheses') the -clone 0 operation. The way this works is this:

  1. convert reads the input TIFF from disk once and loads it into memory.
  2. Each -clone 0 operator makes a copy of the first loaded image (because it has index 0 in the current image stack).
  3. Each "within-parenthesis" sub-pipeline of the total command pipeline performs some operation on the clone.
  4. Each +write operation saves the respective result to disk.

So here is the command to benchmark this:

time for i in $(seq -w 1 100); do
    convert image.tiff                                         \
     -respect-parentheses                                      \
      \( -clone 0 -thumbnail 200x200                           \
                  +write image-thumb-pipel-run-${i}.jpg    \)  \
      \( -clone 0 -resize 1024x                                \
                  -background white label:12345 -append        \
                  +write image-labelled-pipel-run-${i}.jpg \)  \
     null:
   echo "DONE: run pipeline $i ..."
done

My result:

real   0m19.432s
user   0m18.214s
sys    0m1.897s

To my surprise, this is faster than the version which used mpr: !

9. Use -scale or -sample instead of -resize

This alternative will most likely speed up your resizing sub-operation. But it will likely lead to a somewhat worse image quality (you'll have to verify, if this difference is noticeable).

For some background info about the difference between -resize, -sample and -scale see the following answer:

  • What is the difference between sample/resample/scale/resize/adaptive-resize/thumbnail in ImageMagick convert?

I tried it too:

time for i in $(seq -w 1 100); do
    convert image.tiff                                         \
     -respect-parentheses                                      \
      \( -clone 0 -thumbnail 200x200                           \
                  +write image-thumb-pipel-run-${i}.jpg    \)  \
      \( -clone 0 -scale 1024x                                 \
                  -background white label:12345 -append        \
                  +write image-labelled-pipel-run-${i}.jpg \)  \
     null:
   echo "DONE: run pipeline $i ..."
done

My result:

real   0m16.551s
user   0m16.124s
sys    0m1.567s

This is the fastest result so far (I combined it with the +clone variant).

Of course, this modification can also be applied to your initial method running 4 different commands.

10. Emulate the Q8 build by adding -depth 8 to the commands.

I did not actually run and measure this, but the complete command would be.

time for i in $(seq -w 1 100); do
    convert image.tiff                                            \
     -respect-parentheses                                         \
      \( -clone 0 -thumbnail 200x200 -depth 8                     \
                  +write d08-image-thumb-pipel-run-${i}.jpg    \) \
      \( -clone 0 -scale 1024x       -depth 8                     \
                  -background white label:12345 -append           \
                  +write d08-image-labelled-pipel-run-${i}.jpg \) \
     null:
   echo "DONE: run pipeline $i ..."
done

This modification is also applicable to your initial "I run 4 different commands"-method.

11. Combine it with GNU parallel, as suggested by Mark Setchell

This of course is only applicable and reasonable for you, if your overall work process allows for such parallelization.

For my little benchmark testing it is applicable. For your web service, it may be that you know of only one job at a time...

time for i in $(seq -w 1 100); do                                 \
    cat <<EOF
    convert image.tiff                                            \
      \( -clone 0 -scale  1024x         -depth 8                  \
                  -background white label:12345 -append           \
                  +write d08-image-labelled-pipel-run-${i}.jpg \) \
      \( -clone 0 -thumbnail 200x200  -depth 8                    \
                  +write d08-image-thumb-pipel-run-${i}.jpg   \)  \
       null:
    echo "DONE: run pipeline $i ..."
EOF
done | parallel --will-cite

Results:

real  0m6.806s
user  0m37.582s
sys   0m6.642s

The apparent contradiction between user and real time can be explained: the user time represents the sum of all time ticks which where clocked on 8 different CPU cores.

From the point of view of the user looking onto his watch, it was much faster: less than 10 seconds.

12. Summary

Pick your own preferences -- combine different methods:

  1. Some speedup can be gained (with identical image quality as currently) by constructing a more clever command pipeline. Avoid running various commands (where each convert leads to a new process, and has to read its input from disk). Pack all image manipulations into one single process. Make use of the "parenthesized side processing". Make use of -clone or mbr: or mbc: or even combine each of these.

  2. Some speedups can be additionally be gained by trading image quality with performance: Some of your choices are:

    1. -depth 8 (has to be declared on the OP's system) vs. -depth 16 (the default on the OP's system)
    2. -resize 1024 vs. -sample 1024x vs. -scale 1024x
  3. Make use of GNU parallel if your workflow permits this.


As always, @KurtPfeifle has provided an excellently reasoned and explained answer, and everything he says is solid advice which you would do well to listen to and follow carefully.

There is a bit more that can be done though but it is more than I can add as a comment, so I am putting it as another answer, though it is only an enhancement on Kurt's...

I do not know what size of imput image Kurt used, so I made one of 3000x2000 and compared my run times with his to see if they were comparable since we have different hardware. The individual commands ran in 42 seconds on my machine and the pipelined ones ran in 36 seconds so I guess my image size and hardware are broadly similar.

I then used GNU Parallel to run the jobs in parallel - I think you will get a lot of benefit from that on a Xeon. Here is what I did...

time for i in $(seq -w 1 100); do
    cat <<EOF
    convert image.tiff                                        \
     -respect-parentheses                                     \
     +write mpr:XY                                            \
      \( mpr:XY -resize 1024x                                 \
                +write image-1024-pipel-run-${i}.jpg     \)   \
      \( mpr:XY -thumbnail 200x200                            \
                +write image-thumb-pipel-run-${i}.jpg    \)   \
      \( mpr:XY -background white label:12345 -append         \
                +write image-labelled-pipel-run-${i}.jpg \)   \
     null:
   echo "DONE: run pipeline $i ..."
EOF
done | parallel

As you can see, all I did was echo the commands that need running onto stdout and piped them into GNU Parallel. Run that way, it takes just 10 seconds on my machine.

I also had a try at imitating the functionality using ffmpeg, and came up with this, which seems pretty similar on my test images - your mileage may vary.

#!/bin/bash
for i in $(seq -w 1 100); do
    echo ffmpeg -y -loglevel panic -i image.tif ff-$i.jpg 
    echo ffmpeg -y -loglevel panic -i image.tif -vf scale=1024:682 ff-$i-1024.jpg
    echo ffmpeg -y -loglevel panic -i image.tif -vf scale=200:200 ff-$i-200.jpg
done | parallel

That runs in 7 seconds on my iMac with a 3000x2000 image.tif input file.

I failed miserably to get libturbo-jpeg installed with ImageMagick under homebrew.