Joseph Lizier -- Simple cluster jobs framework
Home > Software > Simple cluster jobs framework
A framework for running processes on a cluster using qsub commands
Copyright (C) 2011, Joseph Lizier. joseph.lizier at gmail.com
This set of shell scripts aims to provide a simple way for you to submit large numbers of processes (generally scientific computational tasks) to a cluster.
While originally written for java process, you can use it for any sort of code - the only change required is to alter the command in startProcessTemplate.sh to start your code instead of a java process.
This document describes the usage for this framework. It is included in the distribution as readme.txt.
Licensed under GNU General Public License v3 or any later version - see license-gplv3.txt included in the distribution.
You are asked to reference your use of this framework in any resulting publications as:
Joseph T. Lizier, "Simple cluster jobs framework", 2011.
Download the most recent distribution of the framework here
You will typically have two ways to access the front-end machine of your cluster:
- SCP/FTP access to upload and download files to the file system.
- SSH/telnet access to submit jobs.
The front-end machine is the master, which controls access to the compute nodes of the cluster. You will (or at least should) only ever have direct access to the front-end to access files and to submit your jobs. When your jobs are running on the compute nodes, they should have access to the same file system as you see on the front-end.
To get started running your jobs on the cluster:
- Using your scp/ftp access, send all of the files in the source folder here to the folder on the cluster that you will run your jobs from.
Using your ssh/telnet access, make all of the .sh files executable by typing "chmod u+x *.sh" in the command line.
- Create the following folders underneath the folder you will run from (on the cluster):
- bin -- which is where your code goes
- err -- where std err files will be logged
- processshellscripts -- where scripts to start each individual process will be created
- log -- where std out files will be logged
- props -- where input properties files for each run will be created
- results -- e.g. where results will be stored (if configured in your program - we do this via our properties files). This can of course be altered.
- Move your own code to the bin directory, possibly in a jar file.
- Use an input properties file for your code, so that you can run many instances of the code at once with different input parameters.
Our example input properties file is input.properties.
Your input properties file here will be a template from which tailored properties file for each run will be created. This will be done later by writing specific values into placeholders "[@P1]" etc. We will talk about how these placeholders are filled shortly.
- startProcessTemplate.sh is a template from which we will create individual shell scripts to run each required process. When we submit a job, we'll automatically create the required shell script pointing to the required input properties file and log files.
Update startProcessTemplate.sh to address points 1, 2 and 3 in the file:
- Set current working directory to this one.
- Load any required modules or set paths. E.g. load the java module, or point to the java executable if either is required. Make sure it's the right version of java. You may or may not need the java path specified in the java.path file also.
- Construct the template command needed to run your processes. Include all files and options. Note specifically the placeholders for the input properties file and the log files - these will be filled in automatically when your job is submitted. Also note for java processes the classpath which you need to fill in, and the -Xmx2048m parameter which specifies how much memory you require (important for memory intensive jobs). It is much easier if you can specify all variable arguments in a single input properties file rather than on command line.
- runProcess.sh is a shell script that you can call to submit one job to the cluster.
It creates a job id from the datestamp at submission time. This job id is used to create the start script name (in processshellscripts), the properties filename (in properties), and the log files (in log and err). As such, you should only make one call to runProcess.sh every second.
runProcess.sh takes three arguments on command line which it substitutes into the created properties file for the placeholders "[@P1]" etc. You can add more command line arguments in this script if you like.
The job is then submitted using the qsub command. Normally, you need to specify how much memory and execution time you require here. Unfortunately I've found that different systems require this to be done in different ways (I think this is because they use different frameworks implementing qsub, e.g. on one cluster it was SGE, on others it was PBS) - you will need to check how your system needs this and edit the qsub line.
Often you need to play around with these values - if you make them too large your priority on the cluster drops; but if you make them too small your process will require more resources and get killed.
With all of this in place, you could call e.g. "runProcess a b c" and this will create a properties file with "a" substituted for [@P1] in the properties template, etc, and submit a job to the cluster to run with the newly created properties file. Usually though, you want to run many jobs at once ... (see next point)
- runManyProcesses.sh is an example shell scripts which has loops to submit many jobs to the cluster at once. Basically it loops through many argument combinations to supply to runProcess.sh. You could of course have nested loops, supplying as many parameters [@P1], [@P2] etc as runProcess.sh is able to replace in input.properties.
Note that after each runProcess call the process sleeps for 2 seconds to ensure a unique timestamp is created.
Note also the two different types of condition statements in this scripts - I found that different linux systems required different formats, but I haven't really looked into why.
- After kicking off your cluster jobs, you may find the following commands and scripts useful:
- qstat | grep <username> - where <username> is your username on the system. Displays your jobs and whether they are queued or running or completed, and how long they have been running for. If none are listed, then your jobs have either all finished or been killed.
- qdell <jobid> - stops one of your running jobs. Obtain the jobid's from the qstat command.
- qDelAll.sh - kills all of your running jobs. You will need to edit the grep statement to search on your username.
- Once all of your jobs have finished, check the results files.
Beware that some of your jobs may have been killed (for exceeding their time), so you need a method to check that all of the output files you expect to see are actually there.
Make sure that the output files created by your code all have unique names (they could use timestamps as well, or have the parameter values built in), otherwise they could overwrite eachother. You could simply use the std out log files to record the results of course.
Take a copy of your output files from the cluster file system - most sysadmins warn that they purge all of your data regularly to keep the system clean!
- If results are ok, then run cleanup.sh to clean all the runtime log files, the process script files and the properties files. You may need to edit the first rm line here - this is intended to delete the job files created by the cluster management scripts (the names of these seems to vary system to system). Make sure you've any std out files if you needed them!
Delete manually the result files if necessary after you've taken a copy. (A good idea, since you normally have limited disk space.)
Then it's ready for another run if required.
Original guide by Rose Wang firstname.lastname@example.org; updated by Joe Lizier email@example.com