Table of Contents

Batch processing is the execution of a series of jobs in a program on a computer without manual intervention (non-interactive). The series of steps in a batch process are often called a “job” or “batch job”. In computing context - a batch queue is a scheduler managed automatic system which decides when and where your jobs get run when submitted into a queue. On Aeolus, TORQUE is coupled with MAUI to handle batch job management.

  • TORQUE Resource Manager controls batch jobs and distributed computing resources. It handles the job submissions, starts, stops, and monitors job & node stats. It is based on an open-source PBS resource management system.
  • MAUI is an external scheduler for the Torque resource management system.

To recap, Torque manages the job queue and the compute resources, while Maui queries the PBS server to obtain up to date job and node information. Using this information, Maui directs the PBS server to start or stop jobs in accordance with specified Maui policies, priorities, and/or reservations.

General Rules of Using Aeolus

Aside from the User Policy, these more specific rules apply to use of the system. Accounts found in violation of these rules, may be temporarily banned, require a review of the documentation, and have a discussion with the group/lab PI.

  1. Do not run computation on the ssh/login node.
  2. Do not compile software on the ssh/login node.
  3. Do not write computational output to your home directory (especially true for MPI jobs using parallel I/O).
  4. You may not over subscribe or under subscribe to requested resources.
    • Over Subscription: the resources requested (cpu & ram) is more than the resources required or used.
    • Under Subscription: the resources requested (cpu & ram) is less than the resources required.

Software Packages and Modules

Aeolus uses the module system to provide version consistent distributed applications to compute nodes. To learn the basics, use module --help. To learn more, use man module.

The module system can be used to build an environment unique for your application needs. This can be done before submitting a job or from a script submitted as a job.

Basic command structure looks like:

module avail 
module help [modulefile name]
module list 
module <add | load> [modulefile name] 
module <rm | unload> [modulefile name] 
module <switch | swap> [modulefile name from] [modulefile name to] 
module clear


# query, load, and list a module 
module avail 
module load python/3.6.1 
module list # confirm the module is active in your environment 
which python 

If you know you will always want to have specific modules loaded, you can call them from your ~/.bashrc file, the same way you would load them interactively.

Torque PBS Job Submission

Torque manages jobs that users submit to various queues available on Aeolus. Each queue represents a group of resources with attributes to help identify the queue. Commonly used Torque commands include:

  • qsub - used to submit both batch or interactive jobs to the cluster
  • qstat - used to monitor the status of a job
  • qdel - used to terminate a job prior to its completion
  • showq - list the jobs in the queue
  • pbsnodes - list nodes, properties, and associated resources

Torque includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. Torque directives can appear as header lines (lines that start with #PBS) in a batch job script or as command-line options to the qsub command.

For help using Torque to submit and manage jobs, see the Submitting and managing jobs chapter of Adaptive Computing’s Toruqe Guide. For a complete Appendix of Commands for Torque, see the Adaptive Computing website. This is also where you can get a complete Descriptive List of Options for QSUB.

Torque Environment Variables

This is not an exhaustive list, but should provide a good base to work with. Beyond this, there is the previous listed Adaptive Computing documentation.

PBS_ARRAYIDzero-based value of job array index for this job
PBS_JOBNAMEuser specified job name
PBS_NODEFILEfile containing line-delimited list of nodes allocated to the job
PBS_NODENUMnode offset number
PBS_QUEUEjob queue
PBS_TASKNUMnumber of tasks requested
PBS_O_HOMEhome directory of submitting user
PBS_O_HOSThost on which job script is currently running
PBS_O_JOBIDunique PBS jobid
PBS_O_LANGlanguage variable for job
PBS_O_LOGNAMEname of submitting user
PBS_O_PATHpath variable used to locate executables within job script
PBS_O_SHELLscript shell
PBS_O_WORKDIRjob’s submission directory

Batch Job Scripts

To run a job in batch mode on Aeolus, first prepare a job script that specifies the application you want to run and the resources required to run it, then submit the script to Torque using the qsub command. Torque passes your job and its requirements to the system’s job scheduler, which then dispatches your job whenever the required resources are available.

A very basic job script might contain just a bash or tcsh shell script. However, Torque job scripts most commonly contain at least one executable command preceded by a list of directives that specify resources and other attributes needed to execute the command (e.g. wall-clock time, number of nodes, number of processors, filenames for output and errors). These directives are listed in header lines (lines with #PBS), which should precede an executable lines in your job script.

Additionally, your Torque job script (which will be executed under your preferred login shell) should begin with a line that specifies the command interpreter under which it should run.

NOTE: For some job scripts, you may need to manually load the required module files

Serial Job Example

A serial Torque job script for a serial job might look like this:

# -------- -------- -------- -------- 
# PBS qsub - Scheduler Request 
# * two octothorpes "#" indicates a comment 
# -------- -------- -------- -------- 
## Export all environment variables in the qsub command's environment to the 
## batch job. 
#PBS -V 

## Define a job name 
#PBS -N dev_serial_01 

## Define compute options 
#PBS -l nodes=1:amd:ppn=1 
##PBS -l nodes=1:intel:ppn=1 
#PBS -l mem=1024mb #PBS -l walltime=00:05:00 
#PBS -q batch 

## Define path for output & error logs #PBS -k o 
##PBS -j oe #PBS -e /fastscratch/[username]/dev_serial_01.e 
#PBS -o /fastscratch/[username]/dev_serial_01.o 

## Define path for reporting 
#PBS -M [username] 
#PBS -m abe 

# -------- -------- -------- -------- 
# Actually do something 
# -------- -------- -------- -------- 

touch dev_serial_01.test 

In the above example, the header lines of the Torque directives include/mean the following:

Torque DirectiveDescription
#PBS -VExport your environment variables (see: printenv) in the qsub command’s environment to the batch job.
#PBS -N dev_serial_01Define a name for the job, which will show up in the queue, email, & logs.
#PBS -l nodes=1:amd:ppn=1Indicates the job requires one node, the amd property, and one processor.
#PBS -l mem=1024mbIndicates the job requires 1024 MB of RAM.
#PBS -l walltime=00:05:00Indicates the job requires 05 minutes of wall-clock time.
#PBS -q batchSubmit job the the queue batch.
#PBS -k oKeeps the job output.
#PBS -j oeNot used in this instance, would combine the standard output and standard error.
#PBS -eDefines output location for standard error.
#PBS -oDefines output location for standard out.
#PBS -M [email]Sends job-related email to specified email, according to rules listed on the next line.
#PBS -m abeSends email if the job is (a) aborted, when it (b) begins, and when it (e) ends.

Parallel Job Example

The base job settings are the same for parallel jobs as serial jobs. The main difference(s) which set them apart are the number of processors and/or nodes are requested. This example will cover both shared memory and distributed memory mpi jobs.

The total processor count equals: nodes * ppn

NOTE: pbsnodes and lscpu can provide useful information to your specific node CPU properties.

## Export all environment variables in the qsub command's environment to the
## batch job.

## Define a job name
#PBS -N dev_parallel_01

## Define compute options
#PBS -l nodes=2:amd:ppn=8
  ##PBS -l nodes=2:intel:ppn=8
#PBS -l mem=2gb
#PBS -l walltime=00:05:00
#PBS -q batch

## Define path for output & error logs
#PBS -k o
  ##PBS -j oe
#PBS -e /fastscratch/[username]/dev_parallel_01.e
#PBS -o /fastscratch/[username]/dev_parallel_01.o

## Define path for reporting
#PBS -M [username]
#PBS -m abe

# -------- -------- -------- --------
# Start the script itself
# -------- -------- -------- --------


# Run job on all scheduled threads (processors)
# NProc=$(wc -l $PBS_NODEFILE | awk '{print $1}')
# mpirun -np $NProc -machinefile $PBS_NODEFILE ~/bin/binaryname

# ---- OR ----

# Calculate: physical cores = processors / threads per core
NProc=$(wc -l $PBS_NODEFILE | awk '{print $1}')
NTpC=$(lscpu | grep Thread | awk '{print $4}')
NCore=$(bc <<<"$NProc/$NTpC")

# Create a list of cores or threads (needed when not using all threads)
cat $PBS_NODEFILE | awk -v tpc="$NTpC" 'NR % tpc == 0' > distributed_cores

# Run job on physical cores only (not threads, depends on node type)
mpirun -np $NCore -machinefile distributed_cores ~/bin/binaryname

In the above example, the mpiexec line tells the operating system to use the mpiexec command to execute ~/bin/binaryname on either 4 or 8 processes from the machines listed in $PBS_NODEFILE and evenly spread the load across as many nodes as requested.

Gathering Useful information

Beyond getting a serial or parallel job submitted to the system, it is important that you can arm yourself with information to further your development and research.

After all of the Torque PBS server settings have been set, you can log useful information that will appear in our standard output:

echo "--------- environment ---------"
env | grep PBS

echo "--------- where am i  ---------"

echo "--------- what i do   ---------"
echo Test scheduler via test_script
echo Running time on host `hostname`
echo Time is `date`
echo Directory is `pwd`

echo "--------- end of job  ---------"
echo ""

Submitting Jobs

To submit your job script(s), use the Torque qsub command. If the command runs successfully, it will return a job ID to standard output, for example:

> 123456.mgt2-ib.local

If your job requires attribute values greater than the defaults, less than the maximum allowed, and different than those specified in the job #PBS options, you can specify those inline with the -l (lowercase L, for “limit”) option. For example, the following command submits, using the -l walltime option to indicate the jobs needs a wall time different than that specified in the script.

qsub -l walltime=02:00:00

NOTE: command-line arguments override Torque directives in your job script

To include multiple options on the command line, use either one -l flag with several comma-separated options, or multiple -l flags, each separated by a space. For example, the following two commands are equivalent:

qsub -l ncpus=16,mem=1024mb
qsub -l ncpus=16 -l mem=1024mb

Other useful qsub options include:

qsub optionDescription
-q [queue_name]Specifies user-selected queue
-rSet job to be re-runable
-a [date_time]Executes the job only after specific date and time
-VExport environment variables in current environment to the job
-IMakes job run interactively

Job Dependencies

It is possible to break down the computational requirements of a large job and build a chain of execution where the next job depends on the previous job finishing successfully.

By utilizing the depend=dependency_list flag with qsub, this functionality comes to light. For a more complete list of options, please see the official Torque documentation.

after:jobid[:jobid…]Will schedule a job for execution after jobs jobid have started
afterok:jobid[:jobid…]Will schedule a job for execution after jobs jobid have terminated with no errors
before:jobid[:jobid…]When this job has begun execution, then jobs jobid may begin
beforeok:jobid[:jobid…]When this job has terminated with no errors then job jobid may being

Monitoring Jobs

To monitor the status of a queued or running job, use the qstat command.

Useful qstat options include:

qstat optionDescription
qstat -qdisplay all queues
qstat -adisplay all jobs
qstat -u [user,list]display jobs for users listed in csv [user,list]
qstat -rdisplay running jobs
qstat -fdisplay the full listing of job(s) information
qstat -Qf [queue_id]display information about queue
qstat -rn1display running job & nodes allocated

Other useful monitoring commands include:

showquse Maui to monitor jobs
showq -iuse Maui to monitor queued jobs
checkjob [job_id]show detailed description of the job job_id
showstart [job_id]give an estimate of the expected start time of the job job_id
diagnose -c [LAB/Queue]give detailed information about a specific queue
diagnose -j [job_id]give detailed information about the job
pbsnodesversatile PBS node command, when passed nothing will list compute node system status

Deleting Jobs

To delete queued or running jobs, use the qdel command:

qdel [job_id]

Should a node become unresponsive to Torque, specify a delay and add the -W option to the command:

qdel -w [job_id]

Compiling and Testing

If you recall the General Rules of Using Aeolus, you should NOT run or compile software on the ssh/login node(s). Well, it is not for a lack of resources, it is because that is not the intended location for the task.

To request resource for compiling software, we suggest you enter into an interactive session. This will allocate appropriate resources for you to compile your software and maintain an environment (on a compute node) for you to interact with the building process. qsub -I will request interactive mode and qsub -l will provide properties for the request.

qsub -I -l nodes=1:dev:ppn=1,mem=1024mb,walltime=00:05:00 [~/bin/testapp/]