Submitting Parallel Jobs on a Cluster

Recently, I’ve been running simulations on our school’s computing cluster (JHPCE), which schedules jobs using the Open Grid Engine. Each simulation takes about half a day to complete, so I could run them sequentially and wait a week to get 14 simulation points. Or I could run them in parallel and get 14 simulation points in less than a day!

In theory, running my simulations in parallel should be a very straightforward task. It’s an “embarrassingly parallel” problem – every simulation I need to run uses the same function, with changes only to the input parameters, and they can all be run independently. However, since this was something I’d never done before, it took a while to figure out.¹

Luckily, I found Tianchen’s github repository, which briefly describes how to do this. I don’t know if it’s really the optimal method, but it’s simple and works perfectly well for my needs. In the interest of disseminating some useful knowledge and documenting for my future use, I will walk through the steps of how to submit parallel jobs on the cluster.

1. First Things First

I am going to assume that you are already familiar with how to submit a job on the cluster. If not, go read the tutorial. The important thing is that you need to have a batch job script .sh that calls your Python .py or R .R script (or whatever language you are using). Then you submit the batch job script with the qsub command.

2. Submit A Job with Multiple Tasks

To run things in parallel, we are going to submit a job consisting of multiple tasks. Each task will correspond to a different simulation. To do this, type the following in the terminal:

qsub -t 1:10 your_script.sh

Here, we are submitting one job with 10 tasks. Each task is assigned an ID from 1 to 10, stored in the SGE_TASK_ID environment variable.²

3. Use the Task ID to Set Parameters

In order to change the parameters for each simulation, you need to be able to “read in” the SGE_TASK_ID variable. In Python, you can do this with the following code chunk:

import os
task_id = int(os.getenv("SGE_TASK_ID"))

Then once you have the task ID, you can easily set the parameters of your simulation as needed. For example:

# Parameter options
parameters_arr = np.array([50, 100, 200])  
# Select parameter
parameter = parameters_arr[task_id - 1]

# Run simulation
runSimulation(parameter = parameter)

Similarly, if you are using R, you can read in the task ID and change the parameters of the simulation.³

# Get task ID
task_id <- as.integer(Sys.getenv("SGE_TASK_ID")) 
# Parameter options
parameters <- c(50, 100, 200)                         
# Select parameter
parameter <- parameters[task_id]                  

# Run simulation
runSimulation(parameter = parameter)

This allows you to run simulations with different parameters all at the same time. You can similarly use the task ID to save your simulation results with an informative file name that tells you which simulation/task was ran.

Final thoughts:

This only works for “embarrassingly parallel” problems, where each task can be run independently without needing any information from the other tasks.
Your computing cluster likely has restrictions on the number of scripts you can submit at once. Be careful not to overload the cluster with too many tasks, especially if they are memory-intensive and/or time-intensive. The maximum number of queue slots for the JHPCE cluster is ~200.

I also spent a lot of time figuring out how to run a python script on the cluster, period. In case you are facing the same problem, the issue was that I needed to first load a different python version in my sh file, i.e. module load python/3.6.3, in order to import scipy successfully, since the default version did not have scipy.↩
More documentation available here.↩
Fair warning: I based the R code on Tianchen’s and haven’t actually tested it out myself.↩