Parallelization works, but does not use all CPU power

This is normal. With Intel CPUs that support HyperThreading, Mathematica will launch only as many kernels as there are physical cores. The number of logical cores is typically twice the number of physical cores, so your operating system ends up reporting 50% CPU usage. You can manually launch more parallel kernels (if you have the license for it, see LaunchKernels), but these will either give only a very small speedup or none at all (at worst, they'll slow the calculation down). They won't give you a 2x speedup.

I am not very familiar with AMD CPUs but I think some of them have a similar feature, and I think that AMD typically advertises the number of logical cores, not physical ones. Check how many physical cores you actually have.


Before Mathematica 10, the default number of kernels launched was the same as the number of logical cores. This may improve performance slightly, depending on the specific application, but it has enough disadvantages that I don't think it was a good default choice:

  • The performance increase is often very small.
  • It can potentially reduce performance because Mathematica's parallelization has relatively high overhead.
  • The amount of memory taken is proportional to the number of subkernels, so too many kernels may result in running out of memory sooner. In practice this causes the OS to start to swap, which may practically lock up the machine. (Yes, happened to me because I launched too may kernels, trying to squeeze out that last bit of performance.)
  • Using all your cores at 100% may affect the responsiveness of the computer. This may not be worth a very small performance increase. You may want to do something else while waiting for a long parallel calculation to finish.

This doesn't mean that you shouldn't use all of your logical cores with Mathematica's parallel tools. It just means that doing so automatically is not a good default.

You can always choose to launch more kernels manually using LaunchKernels, or you can change the default number of subkernels in Preferences -> Parallel -> Local Kernels.

Experiment and find out if it's worth doing so for your application.


EDIT: 3 July 2016 - added Test Case and Timings

On a Mac Pro (with 6 physical cores and 12 virtual cores):

  • Under Mma 9: the default Mma behaviour is to launch all 12 kernels to give full use of the computer's capability

  • Under Mma 10: the default Mma behaviour is to artificially restrict Mma to launch on only 6 of the 12 cores

User szabolcs asserts above that this is for the best since launching all kernels:

"will either give only a very small speedup or none at all" --- [ szabolcs ]

So let us test it out ...

Simple test code

I came across a very nice question on the Stats SE site which asks:

how many rolls of a dice are needed until each side has appeared 3 times

See: https://stats.stackexchange.com/questions/211967/expected-number-of-times-to-roll-a-die-until-each-side-has-appeared-3-times/

Of course, the answer to this question is not an exact number -- rather, the number of rolls is a random variable with a distribution that needs to be found. And we can find the exact distribution, as a function of the number of rolls, with just a couple of lines of mma code:

 cdf = ParallelTable[ 
   Probability[x1 >= 3 && x2 >= 3 && x3 >= 3 && x4 >= 3 && x5 >= 3 &&  x6 >= 3, 
       {x1, x2, x3, x4, x5, x6} \[Distributed] MultinomialDistribution[n, Table[1/6, 6]]],
    {n, 18, 60}] // AbsoluteTiming

Solving the above is computationally intensive, and so provides a nice simple test to check whether szabolcs assertion that artificially restricting mma to run on half your available kernels doesn't matter. And here are the results ...

AbsoluteTiming Results -- running Mma 10.4.1

  • Mac Pro - Full Power (12 cores): 132 seconds
  • Mac Pro - Restricted (new default): 234 seconds

In summary: the new default behaviour that artificially restricts the number of kernels to only run on half the available cores slows down a Mac Pro's performance from 132 seconds to 234 seconds ... a 77% increase in time taken.

Given that people who specially purchase multi-core machines presumably do so with the intention of solving parallel problems, artificially numbing the power of the machine doesn't make much sense to me. The above test case illustrates that the benefits are very real and substantive, given problems that can be solved in parallel.

Original Answer

This issue is discussed more fully here:

Mma 10: Half the parallel power (Macs)?

v9 vs v10

  • Under Mma 9, Mathematica automatically take advantage of your computer's full core capability (e.g. on a Mac Pro with 6 physical cores, it automatically takes advantage of all 12 cores ... not just the 6 physical cores).

  • Under Mma 10, by default, Mma only uses as many kernels as physical cores, effectively leaving up to half of your potential processing power unused. I suspect the change was made to limit the RAM demands of Mathematica on entry level systems (even if this unfortunately downgraded the performance of Mma itself, especially for power users).

How to get full performance from your computer

Is there something I can do?

Yes: if you go to:

  • Evaluation Menu -> Parallel Kernel Configuration

... the automatic setting for:

  • Number of kernels to use: is set to: Automatic (which Mma sets to 6 on my Mac Pro)

Change this to:

  • Manual setting

and set it to 12 (or whatever your machine's full capacity is) ... then it will use 12 (subject to licensing - many licenses are restricted to 8).

Performance

Szabolcs wrote:

You can manually launch more parallel kernels (if you have the license for it, see LaunchKernels), but these will either give only a very small speedup or none at all (at worst, they'll slow the calculation down).

I haven't seen any performance tests that would support this conclusion. In fact, the performance tests I have seen suggest quite the opposite: namely that the restriction to just the physical cores has a substantial performance downgrade for problems that would benefit from more cores.

We ran the full mathStatica (primarily symbolic) benchmark suite on a Mac Pro with 6 physical cores and compared:

  • the artificially reduced half-power default setting (6 kernels) - Mac Pro v10 default
  • full-use of your computer's core capability (12 kernels) - Mac Pro v10 manual setting

This suite is not specially designed to test for parallelism - it is just a collection of real world problems, and provides a measure of real world performance. Of the benchmark test problems that can make use of more than 6 cores, all show significant speed slowdowns as a result of artificially limiting the number of cores that Mma can run on:

$$\begin{array}{|c|c|c|} \hline & \text{ Full power (12 cores)} & \text{Restricted (new default)} \\ \hline \text{Varcov matrix} & 7.7 \text{ seconds} & 10.1 \text{ seconds} \\ \hline \text{Multivariate probability} & 15.9 \text{ seconds} & 19.5 \text{ seconds} \\ \hline \text{QQ Plot} & 2.4 \text{ seconds} & 2.9 \text{ seconds} \\ \hline \text{Kernel density} & 5.7 \text{ seconds} & 8.5 \text{ seconds} \\ \hline \end{array}$$

Each example now takes 20% to 50% longer to compute, on the same machine, using the same version of Mathematica: just slower because the new default cripples the machine's power. And these examples are not even optimised for 12 cores. Better gains should be expected with data-based examples: it is easy to map (or should I say ParallelMap) a function over a data set.

The test result fall into 2 categories:

  • For problems that have more than 6 separate components to them: ... For such problems, using 12 kernels is ALWAYS unambiguously faster, and significantly so.

  • For problems that have 6 or less separate components: In these cases, the 6 automatic kernels case is sometimes marginally faster than the 12 kernel case (presumably due to running overheads etc) ... but the difference is tiny, and essentially unnoticeable.

In summary: for problems that CAN benefit from more than 6 kernels, the default Mma 10 (automatic) setting of restricting Mma to run on half the available cores on a Mac Pro appears to be sub-optimal, and fails to take advantage of the full capability of the machine. This problem is new to v10, and did not occur under v9. Manual overrides do exist.

A better approach

My view is that people make a purposeful choice to purchase multi-core machines -- they are spending the considerable extra money seeking to take advantage of those cores because they care about performance -- so to artificially deny them the performance gains they have paid for, and by design intentionally cripple that performance, makes little sense to me. There are, of course, users who don't care about performance and they always have the option of not running in parallel, or restricting the number of cores to fit their available RAM.

It seems that what is needed is a more sophisticated approach to managing this problem - one that looks at available RAM, number of cores etc, and suggests a number of kernels that is considered optimal for your system. The same optimal setting will differ between one user who has 64G of RAM, and another who has 8G RAM, on otherwise identical machines.


I am not a Mathematica user but can talk about the rest. "Logical cores" are virtual. It is a way for the CPU to squeeze out some performance from otherwise inefficient things running in parallel AND to better utilize its inner facilities. For example, each physical core has multiple separate integer and floating point units. Running just one process on each core allows some parallel use of these units but relies on run-time instruction reordering and that depends on instruction dependencies (if A depends on B then A has to run after B, cannot run in parallel with it). If a single core "pretends" to be two there are some more chances of it finding instructions it can run in parallel (say one integer and one floating point), from two different threads. But this still relies on a shared set of resources, including shared caches and internal buses. If both threads want to run same instructions they will compete for the same unit too... So the speedup is really an optimization.