Accounting¶

The accounting funcionality of SLURM batch system keeps track of the resources spent by users while running jobs on Devana cluster. In order to submit a job on cluster the user must have access to the project (either testing, regular or commercial) which has a sufficient free allocation to run the job. You can find more information on how to get access and create a user project here.

There are two kind af allocations available for the project

CPU allocation (to be spend on the general purpose nodes)
GPU allocation (for the jobs that require accellerated nodes)

The appropriate time limits are spending automatically, just by selecting a partition for job execution. You can check your remaining allocation using sprojects command.

If you are out of project quota, your job submission will still work normally, however the jobs will stay in pending state as in the following example:

Out of project quota


login01:~$ squeue
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    224     short    example     user PD       0:00      1 (QOSGrpBillingMinutes)

Billing Units¶

The project allocations are by default awarded in corehours and since one compute node has 64 cores, running your job for one hour on one node should substract 64 corehours from you project allocation. There are however situations when your job does not utilize all 64 CPU cores while still occuppying the whole node (e.g. some memory intensive calculations). The job accounting is therefore performed in arbitrary billing units which take care of a fair tracking of job resources. Billing units are calculated as follows:

BU=MAX(# of cores, memory in GB * 0.256, # of GPUs * 16)

You can check you job's billing units rate by scontrol show job JOBID command. The following examples demonstrate billing units behaviour in more detail:

Different jobs utilizaton

MPI jobs on all CPU coresMPI jobs on part of the CPU coresLarge memory jobsGPU jobs

In this case we are running a job on all 64 cores within 1 node. The billing rate is therefore 64.


 demovic@login02 ~ > srun -n 64 --pty bash
   srun: job 30925 queued and waiting for resources
   srun: job 30925 has been allocated resources
 

 demovic@n043 ~ > scontrol show job 30925
   JobId=30925 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:08:13 EligibleTime=2023-08-08T15:08:13
   AccrueTime=2023-08-08T15:08:13
   StartTime=2023-08-08T15:08:14 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:08:14 Scheduler=Main
   Partition=ncpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n043
   BatchHost=n043
   NumNodes=1 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=64,mem=250G,node=1,billing=64
   Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=

Similar as above, but utilizing 2 full nodes. Thus the billing rate is 128 (2*64).


 demovic@login02 ~ > srun -N 2 --ntasks-per-node=64 --pty bash
 srun: job 30927 queued and waiting for resources
 srun: job 30927 has been allocated resources
 

 demovic@n037 ~ > scontrol show job 30927
   JobId=30927 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18993 Nice=0 Account=p70-23-t QOS=p70-23-t
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:20:25 EligibleTime=2023-08-08T15:20:25
   AccrueTime=2023-08-08T15:20:25
   StartTime=2023-08-08T15:20:26 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:20:26 Scheduler=Main
   Partition=ncpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n[037-038]
   BatchHost=n037
   NumNodes=2 NumCPUs=128 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=128,mem=500G,node=2,billing=128
   Socks/Node=* NtasksPerN:B:S:C=64:0:: CoreSpec=*
   MinCPUsNode=64 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=

When using only a part of the node, the billing rate is adjusted proportionally. In this case we are utilizing 32 cores therefore the billing rate is halved.


 demovic@login02 ~ > srun -n 32 --pty bash
   srun: job 30929 queued and waiting for resources
   srun: job 30929 has been allocated resources
 

 demovic@n043 ~ > scontrol show job 30929
   JobId=30929 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:26:13 EligibleTime=2023-08-08T15:26:13
   AccrueTime=2023-08-08T15:26:13
   StartTime=2023-08-08T15:26:14 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:26:14 Scheduler=Main
   Partition=ncpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n043
   BatchHost=n043
   NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=32,mem=125G,node=1,billing=32
   Socks/Node=* NtasksPerN:B:S:C=32:0:: CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=

This example shows a behaviour when user requests all available memory within a node. Although the job is requesting only one core, the billing rate corresponds to the whole node utilization (64) as there is no more space for other users jobs.


 demovic@login02 ~ > srun -n 1 --mem=250GB --pty bash
   srun: job 30930 queued and waiting for resources
   srun: job 30930 has been allocated resources
 

 demovic@n043 ~ > scontrol show job 30930
   JobId=30930 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:05 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:33:50 EligibleTime=2023-08-08T15:33:50
   AccrueTime=2023-08-08T15:33:50
   StartTime=2023-08-08T15:33:51 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:33:51 Scheduler=Main
   Partition=ncpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n043
   BatchHost=n043
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=1,mem=250G,node=1,billing=64
   Socks/Node=* NtasksPerN:B:S:C=1:0:: CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=

The same as above but with 32 cores per node. This example shows that higher utilization factor (memory in this case) takes precedence.


 demovic@login02 ~ > srun -n 32 --mem=250GB --pty bash
   srun: job 30931 queued and waiting for resources
   srun: job 30931 has been allocated resources
 

 demovic@n043 ~ > scontrol show job 30931
   JobId=30931 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:36:12 EligibleTime=2023-08-08T15:36:12
   AccrueTime=2023-08-08T15:36:12
   StartTime=2023-08-08T15:36:13 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:36:13 Scheduler=Main
   Partition=ncpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n043
   BatchHost=n043
   NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=32,mem=250G,node=1,billing=64
   Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=

Since the acclerated nodes have 4 GPUs, allocating just one of them causes billing rate to be at ¼ of the total rate per node. Notice that RAM allocation has been set to 62.5 GB (¼ of the total RAM) by SLURM automatically.


 demovic@login02 ~ > srun --partition=ngpu -G 1 --pty bash
   srun: job 30934 queued and waiting for resources
   srun: job 30934 has been allocated resources
 

 demovic@n143 ~ > scontrol show job 30934
   JobId=30934 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t_gpu
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:03 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:52:39 EligibleTime=2023-08-08T15:52:39
   AccrueTime=2023-08-08T15:52:39
   StartTime=2023-08-08T15:52:40 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:52:40 Scheduler=Main
   Partition=ngpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n143
   BatchHost=n143
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=1,mem=62.50G,node=1,billing=16,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=
   MemPerTres=gpu:64000
   TresPerJob=gres:gpu:1

When using 2 GPUs, the billing rate is 32.


 demovic@login02 ~ > srun --partition=ngpu -G 2 --pty bash
   srun: job 30937 queued and waiting for resources
   srun: job 30937 has been allocated resources
 

 demovic@n143 ~ > scontrol show job 30937
   JobId=30937 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t_gpu
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T16:04:17 EligibleTime=2023-08-08T16:04:17
   AccrueTime=2023-08-08T16:04:17
   StartTime=2023-08-08T16:04:17 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T16:04:17 Scheduler=Main
   Partition=ngpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n143
   BatchHost=n143
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=1,mem=125G,node=1,billing=32,gres/gpu=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=
   MemPerTres=gpu:64000
   TresPerJob=gres:gpu:2

And finally, although the job needs just one GPUs, it's billed at the rate of 64, since it is using the whole memory.


 demovic@login02 ~ > srun --partition=ngpu -G 1 --mem=250GB --pty bash
   srun: job 30936 queued and waiting for resources
   srun: job 30936 has been allocated resources
 

 demovic@n143 ~ > scontrol show job 30936
   JobId=30936 JobName=bash
   UserId=demovic(187000051) GroupId=demovic(187000051) MCS_label=N/A
   Priority=18994 Nice=0 Account=p70-23-t QOS=p70-23-t_gpu
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-08T15:59:37 EligibleTime=2023-08-08T15:59:37
   AccrueTime=2023-08-08T15:59:37
   StartTime=2023-08-08T15:59:38 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-08T15:59:38 Scheduler=Main
   Partition=ngpu AllocNode:Sid=login02:9298
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n143
   BatchHost=n143
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=1,mem=250G,node=1,billing=64,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=250G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/demovic
   Power=
   MemPerTres=gpu:64000
   TresPerJob=gres:gpu:1

Monitoring Past Jobs Efficiency¶

Walltime estimation and job efficiency

By default, none of the regular jobs you submit can exceed a walltime of 4 days (4-00:00:00). However, you have a strong interest to estimate accurately the walltime of your jobs. While it is not always possible, or quite hard to guess at the beginning of a given job campaign where you'll probably ask for the maximum walltime possible, you should look back as your historical usage for the past efficiency and elapsed time of your previously completed jobs using seff utilitiy. Update the time constraint [#SBATCH] -t [...] of your jobs accordingly, as shorter jobs are scheduled faster.

This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.

The seff utility will help you track the CPU/Memory efficiency. The command is invoked as:

login01:~$ seff <jobid>

Jobs with different CPU/Memory efficiency

Good CPU eff.Good Memory Eff.Good CPU and Memory Eff.[Very] Bad efficiency

login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 32
  CPU Utilized: 41-01:38:14
  CPU Efficiency: 99.64% of 41-05:09:44 core-walltime
  Job Wall-clock time: 1-11:19:38
  Memory Utilized: 2.73 GB
  Memory Efficiency: 2.13% of 128.00 GB

login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 16
  CPU Utilized: 14:24:49
  CPU Efficiency: 23.72% of 2-12:46:24 core-walltime
  Job Wall-clock time: 03:47:54
  Memory Utilized: 193.04 GB
  Memory Efficiency: 75.41% of 256.00 GB

login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 87-16:58:22
  CPU Efficiency: 86.58% of 101-07:16:16 core-walltime
  Job Wall-clock time: 1-13:59:19
  Memory Utilized: 212.39 GB
  Memory Efficiency: 82.96% of 256.00 TB

This illustrates a very bad job in terms of CPU/memory efficiency (below 4%), which illustrate a case where basically the user wasted 4 hours of computation while mobilizing a full node and its 64 cores.

login01:~$ seff <jobid>
  Job ID: <jobid>
  User/Group: user1/group1
  State: COMPLETED (exit code 0)
  Nodes: 1
  Cores per node: 64
  CPU Utilized: 00:08:33
  CPU Efficiency: 3.55% of 04:00:48 core-walltime
  Job Wall-clock time: 00:08:36
  Memory Utilized: 55.84 MB
  Memory Efficiency: 0.05% of 112.00 GB

sprojects - View Projects Information¶

This command displays information about projects available to a user and project details, such as available allocations, shared directories and members of the project team.

The sprojects script shows the available slurm account (projects) for the selected user ID. If no user is specified (with -u) the script will display the info for current user.

Show available accounts for the current user

user1@login01:~$ sprojects 
   The following slurm accounts are available for user user1:
   p70-23-t

Option -a force the script to display just allocations (in corehours or GPU hours) as: SPENT/AWARDED.

Show all available allocations for the current user

login01:~$ sprojects -a 
   +=================+=====================+
   |     Project     |     Allocations     |
   +-----------------+---------------------+
   | p70-23-t        | CPU:      10/50000  |
   |                 | GPU:       0/12500  |
   +=================+=====================+

With -f option the script will display more details (including available allocations).

Show full info for the current user

login01:~$ sprojects -f 
   +=================+=========================+============================+=====================+
   |     Project     |       Allocations       |      Shared storages       |    Project users    |
   +-----------------+-------------------------+----------------------------+---------------------+
   | p371-23-1       | CPU:    182223/500000   | /home/projects/p371-23-1   | user1               |
   |                 | GPU:       542/1250     | /scratch/p371-23-1         | user2               |
   |                 |                         |                            | user3               |
   +-----------------+-------------------------+----------------------------+---------------------+
   | p81-23-t        | CPU:     50006/50000    | /home/projects/p81-23-t    | user1               |
   |                 | GPU:       766/781      | /scratch/p81-23-t          | user2               |
   +-----------------+-------------------------+----------------------------+---------------------+
   | p70-23-t        | CPU:    485576/5000000  | /home/projects/p70-23-t    | user1               |
   |                 | GPU:       544/31250    | /scratch/p70-23-t          | user2               |
   |                 |                         |                            | user4               |
   |                 |                         |                            | user5               |
   |                 |                         |                            | user6               |
   |                 |                         |                            | user7               |
   +=================+=========================+============================+=====================+

Created by: Marek Štekláč