Job management commands
- sacct: inventory data for completed and pending jobs (sacct -j
) - sstat: statistics of the jobs being performed (sstat -j
--format = AveCPU, AveRSS, AveVMSize, MaxRSS, MaxVMSize) - scontrol show: e.g. scontrol show job | partition
- scontrol update: change the transaction
- scontrol hold: pause the job
- scontrol release: release the job
- sprio: displays job priority
- scancel: cancel the job
Monitoting jobs
List all current jobs for a user:
squeue -u <username>
The output of the squeue command consists of several columns including job ID, partition, job name, username, job state, elapsed time, number of nodes, node list, etc.
JOBID PARTITION NAME USER ST TIME NODES NODE LIST (REASON)
499980 longcpu vega208t user PD 0:00 1 (Resources)
499981 longcpu vega192t user PD 0:00 1 (Priority)
449911 longcpu bxe_t280 user R 1-01:23:39 1 cn0402
499889 longcpu vega256t user R 3:29:24 1 cn0011
449133 longcpu bxe_t240 user R 1-03:38:21 1 cn0401
Job state is listed in the ST column of the output of the squeue command. The most common job state codes are:
- R: Running
- PD: Pending
- CG: Completing
- CA: Cancelled
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List all current jobs in the shared partition for a user:
squeue -u <username> -p shared
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
List status info for a currently running job:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=jobID,JobName%20,NNodes,NTasks,NCPUS,MaxRSS,AveRSS,Elapsed,ExitCode
To view the same information for all jobs of a user:
sacct -u <username> -S <start date> -E <end date> --format=jobID,JobName%20,NNodes,NTasks,NCPUS,MaxRSS,AveRSS,Elapsed,ExitCode
In case, that user works on multiple projects, it is important to change account accourding to this link.
Cancelling jobs
For various reasons, you might want to terminate your running jobs or remove your waiting jobs from the queue. The command is scancel. Read "man scancel" documentation for more information. Run the straightforward command to kill two of your jobs, by giving their job number.
$ scancel <Job ID> <Job ID>
The following command
$ scancel -i -u your_account_name
kills all your jobs, but asks for each job if you really want to terminate that job.
$ scancel -u your_account_name --state=pending
terminates all your waiting jobs.
To hold a particular job from being scheduled:
scontrol hold <jobid>
To release a particular job to be scheduled:
scontrol release <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>
In case, that user works on multiple projects, it is important to change account accourding to this link.