How much memory did my job use?
You can check how much memory your job used by using the sacct command. Simply replace YYYY-MM-DD with the date you ran the job:
sacct --starttime=YYYY-MM-DD --jobs=your_job-id --format=User,JobName,JobId,MaxRSS
If you’d like to monitor memory usage on jobs that are currently running, use the sstat command:
sstat --jobs=your_job-id --format=User,JobName,JobId,MaxRSS
How can I check what allocations I belong to?
You can check the allocations you belong to with the sacctmgr command.
sacctmgr -p show associations user=$USER
from a login node. This will print out an assortment of information including allocations and QoS available to you.
How can I see my current FairShare priority?
You can check your current fair share priority level using the sshare command:
sshare -U -l
The sshare command will print out a table of information regarding your usage and priority on all allocations. The -U flag will specify the current user and the -l flag will print out more details in the table. The field we are looking for is the LevelFS. The LevelFS holds a number from 0 to infinity that describes the fair share of an association in relation to its other siblings in an account. Over serviced accounts will have a LevelFS that’s between 0 and 1. Under serviced accounts will have a LevelFS that’s greater than 1. Accounts that haven’t run any jobs will have a LevelFS of infinity (inf).
Why is my job pending with reason ‘ReqNodeNotAvail’?
The ‘ReqNodeNotAvail’ message usually means that your node has been reserved for maintenance during the period you have requested within your job script. This message often occurs in the days leading up to our regularly scheduled maintenance. You can confirm whether the requested node has a reservation by typing scontrol show reservation to list all active reservations.
If you receive this message, the following solutions are available:
- 1) run a shorter job that does not intersect the maintenance window;
- 2) wait until after maintenance.
Job failed with error cgroup out-of-memory handler
In this case, the often error output is:
/var/spool/slurmd/job000000/slurm_script: line 17: 614149 Killed perl /ceph/hpc/data/stXXXC-user-users/User/Marco/NOVOPlasty/NOVOPlasty4.3.1.pl -c config_prova.txt slurmstepd: error: Detected 1 oom-kill event(s) in StepId=223809.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Answer: The job ran out of memories. Set up more memory.
You can check your job with the command ./seff
[root@vglogin0001 ~]# ./seff 000000 Job ID: 000000 Cluster: vega User/group: user/user Status: OUT_OF_MEMORY (exit code 0) Nodes: 1 Cores per node: 40 CPU used: 00:41:33 CPU Performance: 2.72% 1-01:25:20 core-walltime Working hours of the wall clock: 00:38:08 Memory used: 87.76 GB Memory efficiency: 109.70% of 80.00 GB
The Status shows that the job is out of memory.