Table of Contents
Batch processing is the execution of a series of jobs in a program on a computer without manual intervention (non-interactive). The series of steps in a batch process are often called a “job” or “batch job”. In computing context - a batch queue is a scheduler managed automatic system which decides when and where your jobs get run when submitted into a queue. On Aeolus, TORQUE is coupled with MAUI to handle batch job management.
- TORQUE Resource Manager controls batch jobs and distributed computing resources. It handles the job submissions, starts, stops, and monitors job & node stats. It is based on an open-source PBS resource management system.
- MAUI is an external scheduler for the Torque resource management system.
To recap, Torque manages the job queue and the compute resources, while Maui queries the PBS server to obtain up to date job and node information. Using this information, Maui directs the PBS server to start or stop jobs in accordance with specified Maui policies, priorities, and/or reservations.
General Rules of Using Aeolus
Aside from the User Policy, these more specific rules apply to use of the system. Accounts found in violation of these rules, may be temporarily banned, require a review of the documentation, and have a discussion with the group/lab PI.
- Do not run computation on the ssh/login node.
- Do not compile software on the ssh/login node.
- Do not write computational output to your home directory (especially true for MPI jobs using parallel I/O).
- You may not over subscribe or under subscribe to requested resources.
- Over Subscription: the resources requested (cpu & ram) is more than the resources required or used.
- Under Subscription: the resources requested (cpu & ram) is less than the resources required.
Software Packages and Modules
Aeolus uses the
module system to provide version consistent distributed applications to compute nodes. To learn the basics, use
module --help. To learn more, use
The module system can be used to build an environment unique for your application needs. This can be done before submitting a job or from a script submitted as a job.
Basic command structure looks like:
If you know you will always want to have specific modules loaded, you can call them from your
~/.bashrc file, the same way you would load them interactively.
Torque PBS Job Submission
Torque manages jobs that users submit to various queues available on Aeolus. Each queue represents a group of resources with attributes to help identify the queue. Commonly used Torque commands include:
qsub- used to submit both batch or interactive jobs to the cluster
qstat- used to monitor the status of a job
qdel- used to terminate a job prior to its completion
showq- list the jobs in the queue
pbsnodes- list nodes, properties, and associated resources
Torque includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. Torque directives can appear as header lines (lines that start with
#PBS) in a batch job script or as command-line options to the
For help using Torque to submit and manage jobs, see the Submitting and managing jobs chapter of Adaptive Computing’s Toruqe Guide. For a complete Appendix of Commands for Torque, see the Adaptive Computing website. This is also where you can get a complete Descriptive List of Options for QSUB.
Torque Environment Variables
This is not an exhaustive list, but should provide a good base to work with. Beyond this, there is the previous listed Adaptive Computing documentation.
|PBS_ARRAYID||zero-based value of job array index for this job|
|PBS_JOBNAME||user specified job name|
|PBS_NODEFILE||file containing line-delimited list of nodes allocated to the job|
|PBS_NODENUM||node offset number|
|PBS_TASKNUM||number of tasks requested|
|PBS_O_HOME||home directory of submitting user|
|PBS_O_HOST||host on which job script is currently running|
|PBS_O_JOBID||unique PBS jobid|
|PBS_O_LANG||language variable for job|
|PBS_O_LOGNAME||name of submitting user|
|PBS_O_PATH||path variable used to locate executables within job script|
|PBS_O_WORKDIR||job’s submission directory|
Batch Job Scripts
To run a job in batch mode on Aeolus, first prepare a job script that specifies the application you want to run and the resources required to run it, then submit the script to Torque using the
qsub command. Torque passes your job and its requirements to the system’s job scheduler, which then dispatches your job whenever the required resources are available.
A very basic job script might contain just a bash or tcsh shell script. However, Torque job scripts most commonly contain at least one executable command preceded by a list of directives that specify resources and other attributes needed to execute the command (e.g. wall-clock time, number of nodes, number of processors, filenames for output and errors). These directives are listed in header lines (lines with
#PBS), which should precede an executable lines in your job script.
Additionally, your Torque job script (which will be executed under your preferred login shell) should begin with a line that specifies the command interpreter under which it should run.
NOTE:For some job scripts, you may need to manually load the required module files
Serial Job Example
A serial Torque job script for a serial job might look like this:
In the above example, the header lines of the Torque directives include/mean the following:
|#PBS -V||Export your environment variables (see: |
|#PBS -N dev_serial_01||Define a name for the job, which will show up in the queue, email, & logs.|
|#PBS -l nodes=1:amd:ppn=1||Indicates the job requires one node, the |
|#PBS -l mem=1024mb||Indicates the job requires 1024 MB of RAM.|
|#PBS -l walltime=00:05:00||Indicates the job requires 05 minutes of wall-clock time.|
|#PBS -q batch||Submit job the the queue |
|#PBS -k o||Keeps the job output.|
|#PBS -j oe||Not used in this instance, would combine the standard output and standard error.|
|#PBS -e||Defines output location for standard error.|
|#PBS -o||Defines output location for standard out.|
|#PBS -M [email]||Sends job-related email to specified email, according to rules listed on the next line.|
|#PBS -m abe||Sends email if the job is (a) aborted, when it (b) begins, and when it (e) ends.|
Parallel Job Example
The base job settings are the same for parallel jobs as serial jobs. The main difference(s) which set them apart are the number of processors and/or nodes are requested. This example will cover both shared memory and distributed memory mpi jobs.
The total processor count equals:
nodes * ppn
lscpucan provide useful information to your specific node CPU properties.
In the above example, the
mpiexec line tells the operating system to use the mpiexec command to execute ~/bin/binaryname on either 4 or 8 processes from the machines listed in $PBS_NODEFILE and evenly spread the load across as many nodes as requested.
Gathering Useful information
Beyond getting a serial or parallel job submitted to the system, it is important that you can arm yourself with information to further your development and research.
After all of the Torque PBS server settings have been set, you can log useful information that will appear in our standard output:
To submit your job script(s), use the Torque
qsub command. If the command runs successfully, it will return a job ID to standard output, for example:
If your job requires attribute values greater than the defaults, less than the maximum allowed, and different than those specified in the job
#PBS options, you can specify those inline with the
-l (lowercase L, for “limit”) option. For example, the following command submits job.sh, using the
-l walltime option to indicate the jobs needs a wall time different than that specified in the script.
NOTE: command-line arguments override Torque directives in your job script
To include multiple options on the command line, use either one
-l flag with several comma-separated options, or multiple
-l flags, each separated by a space. For example, the following two commands are equivalent:
Other useful qsub options include:
|-q [queue_name]||Specifies user-selected queue|
|-r||Set job to be re-runable|
|-a [date_time]||Executes the job only after specific date and time|
|-V||Export environment variables in current environment to the job|
|-I||Makes job run interactively|
It is possible to break down the computational requirements of a large job and build a chain of execution where the next job depends on the previous job finishing successfully.
By utilizing the
depend=dependency_list flag with qsub, this functionality comes to light. For a more complete list of options, please see the official Torque documentation.
|after:jobid[:jobid…]||Will schedule a job for execution after jobs jobid have started|
|afterok:jobid[:jobid…]||Will schedule a job for execution after jobs jobid have terminated with no errors|
|before:jobid[:jobid…]||When this job has begun execution, then jobs jobid may begin|
|beforeok:jobid[:jobid…]||When this job has terminated with no errors then job jobid may being|
To monitor the status of a queued or running job, use the
qstat options include:
|qstat -q||display all queues|
|qstat -a||display all jobs|
|qstat -u [user,list]||display jobs for users listed in csv [user,list]|
|qstat -r||display running jobs|
|qstat -f||display the full listing of job(s) information|
|qstat -Qf [queue_id]||display information about queue|
|qstat -rn1||display running job & nodes allocated|
Other useful monitoring commands include:
|showq||use Maui to monitor jobs|
|showq -i||use Maui to monitor queued jobs|
|checkjob [job_id]||show detailed description of the job job_id|
|showstart [job_id]||give an estimate of the expected start time of the job job_id|
|diagnose -c [LAB/Queue]||give detailed information about a specific queue|
|diagnose -j [job_id]||give detailed information about the job|
|pbsnodes||versatile PBS node command, when passed nothing will list compute node system status|
To delete queued or running jobs, use the
Should a node become unresponsive to Torque, specify a delay and add the -W option to the command:
Compiling and Testing
If you recall the General Rules of Using Aeolus, you should NOT run or compile software on the ssh/login node(s). Well, it is not for a lack of resources, it is because that is not the intended location for the task.
To request resource for compiling software, we suggest you enter into an interactive session. This will allocate appropriate resources for you to compile your software and maintain an environment (on a compute node) for you to interact with the building process.
qsub -I will request interactive mode and
qsub -l will provide properties for the request.