Mathematica Parellelization on HPC

What you need to launch subkernels across several nodes on a HPC cluster is the following:

  1. Figure out how to request several compute nodes for the same job
  2. Find the names of the nodes that have been allocated for your job
  3. Find out how to launch subkernels on these nodes from within the main kernel

All of these depend on the grid engine your cluster is using, as well as your local setup, and you'll need to check its docs and ask your administrator about the details. I have an example for our local setup (complete with a jobfile), which might be helpful for you to study:

https://bitbucket.org/szhorvat/crc/src

Our cluster uses the Sun Grid Engine. The names of the nodes (and information about them) are listed in a "hostfile" which you can find by retrieving the value of the PE_HOSTFILE environment variable. (I think this works the same way with PBS, except the environment variable is called something else.)

Note that if you request multiple nodes in a single job file, the job script will be run on only one of the nodes, and you'll be launching the processes across all nodes manually (at least on SGE and PBS).

Launching processes on different nodes is usually possible with ssh: just run ssh nodename command to run command. You may also need to set up passphraseless authentication if it is not set up by default. To launch subkernels, you'll need to pass the -f option to ssh to let it return immediately after it has launched the remote process.

Some setups use rsh instead of ssh. To launch a command in the background using rsh, you'll need to do

rsh -n nodename "command >& /dev/null &"

To run the remote process in the background, it important to redirect the output (both stdout and stderr) because there's a bug in rsh (also described in its man page) that won't let it return immediately otherwise.

Another thing to keep in mind about rsh is that you can't rsh to the local machine, so you'll need to launch the subkernels which will run on the same machine as the main kernel without rsh.

See my example for details.


Update

The node names in a job can be access through environment variables such as PBS_NODEFILE and HOSTNAME, so that launching subkernels on the correct nodes can be automated.


I'm also trying the run more subKernels from a main kernel on HPC. I usually apply an interaction job on the HPC and run math kernels on it, and then connect back to the front end on may laptop. My waiting time for the queue of the interactive job is very short so it is convenient for me to do the work in the interactive way. Here is how I did, it may not be the same, but hope it would help.

Apply a interative job

qsub -V -I -l walltime=01:00:00,nodes=2:ppn=16 -A hpc_atistartup

it will return something like this:

qsub: waiting for job 48488.mike3 to start
qsub: job 48488.mike3 ready

--------------------------------------
Running PBS prologue script
--------------------------------------
PBS has allocated the following nodes:

mike054
mike067

A total of 32 processors on 2 nodes allocated
---------------------------------------------
Check nodes and clean them of stray processes
---------------------------------------------
Checking node mike054 15:43:46 
Checking node mike067 15:43:48 
Done clearing all the allocated nodes
------------------------------------------------------
Concluding PBS prologue script - 01-Sep-2013 15:43:48
------------------------------------------------------
[aaa@mike054 ~]$ 

We can see I get nodes mike054 and mike067, and the shell is on node mike054.

Start remote master kernel

From the menu of the local front end(my laptop), Evaluation ==> Kernel Configuration Options , add a remote Kernel, here I added one called superMike. Select "Advanced Options" and fill it with "-LinkMode Listen -LinkProtocol TCPIP".

enter image description here

Then run a command in a notebook, for example $Version. It would pop out a window like this:

enter image description here

The port and IP address should be different than mine.

With this pop up window opened, go to your shell at the HPC we just get, run the command math to launch command line mathematica. After I get the mathematica shell, enter

$ParentLink = LinkConnect["[email protected],[email protected]", LinkProtocol->"TCPIP"]

and hit enter. Then hit the "OK" key of that pop up window. If it successfully connected, it would pop up a message window with

Out[1]:=LinkObject[[email protected],[email protected], 59, 2]

and the $Version command should return the results:

enter image description here

For details of the remote kernel connection, see the post here.

Start subKernels

Open the Remote Kernels tab in Evaluation ==> Parallel Kernel Configuration, clink "Add Host" to add other nodes we get in the interactive job. In this case I get nodes mike054 and mike067, and the shell I get is on node mike054. So I will add mike067 by fill the Hostname, set the number of kernels and check "Enable".

enter image description here

After that we can go to Evaluation ==> Parallel Kernel status, and check whether the subKernel are working. If everything went successfully we can see something like this

enter image description here

We can see that we've launched 16 subKernels on node mike054 and 16 subKernels on node mike067.

Hope it will help.