Greetings to all Deep Thought HPC Users!
This is Scott, one of the HPC Support officers – this is a notice of upcoming changes to put some new features in place for our HPC.
Please read this (rather long) notice right to the end, as there are lots of MAJOR changes that I am excited to put into place.
The Short-List:
- Expected Length of Downtime: 1 Day
- Scheduled Date of Downtime: 09/06/2020
ALL JOBS STILL RUNNING WILL BE LOST. THIS IS A DESTRUCTIVE UPDATE TO THE SLURM SCHEDULER AND YOU CANNOT RESUME AN OLD JOB FROM THE PREVIOUS VERSION. ENSURE ALL JOBS ARE COMPLETE BEFORE THE PRESCRIBED DATE
Change List:
- Change in the Resource allocation system
- Removal of HPC Partitions
- Limitations on job resources
- Implementation of Fairshare & Priority System
- GPU’s online for usage
- Expansion of the Accounting System
What this means for you as a user:
Change in the Resource allocation system:
Too-long, didn’t-Read (TL;DR):
- Instead of a Whole Node, you get whatever resources you asked for, nothing more, nothing less.
- If you use more than what you asked for, your job WILL be TERMINATED
- Needed to track GPU’s & Fairshare
Currently, SLURM allocates you an entire Node per job, and this is fine when we were getting everything up and running. What will happen going forwards is that SLURM will allocate you the exact resources that you ask it for, and not anymore. As a side effect, if you ask for 10 GB of RAM and your job tries to use 20GB – SLURM will terminate your job due to over-utilization of resources.
This change to tracking CPU & RAM as separate resources that can be allocated in an independent manner means that we can have more jobs running at one time (which is good for you!).
As this changes the entire way the SLURM tracks jobs, resume data and schedules new jobs, this is a destructive change. We cannot resume any jobs that were running under the old scheduler, as they are incompatible at a base level.
Removal of HPC Partitions:
TL;DR:
- No more hpc_cse, hpc_cmph, etc.., just one big pool that everybody gets to play in called hpc_general, which will be the default
- You MUST UPDATE your SLURM scripts, to either submit to this new partition, or remove the –partition=X line, letting SLURM allocate you to the default queue
When we implement Fairshare, we can also get rid of (most) of the artificial barriers on the HPC that are currently present. This means that you will no longer have to be artificially segmented to specific nodes, the whole HPC (minus the private nodes) is open to compute usage.
This is both a better deal for you as a user (more resources!) and us as for management (less configuration!). The sharing/priority allocation system is managed by your Fairshare Score.
When submitting jobs, you will not longer have to specify a –partition=X, as the default job-queue you get allocated to will be the new hpc_general queue. You can either remove the –partition=X directive or change the partition to hpc_general in your SLURM scripts.
MELFU USERS: You still need to specify the partition, as your job queue is unchanged. If you need access to the GPU’s however, you will need to submit the job to the hpc_general queue.
Limitations on job resources:
TL;DR:
- If you use more than what you ask for, your job is terminated.
- If you leak into another user’s allocation by going over what you asked for, there is a massive, cluster-wide performance drain.
- This is required for GPU allocations to function correctly
When you ask for resources, they are allocated in an exclusive manner – that is, nobody else has access to it. If you ‘leak’ over to another user’s allocation, then there is an exponential performance DECREASE of the entire compute node. So, to prevent this and as we are breaking down existing barriers into one giant pool, SLURM will now track your “Requested” vs your “Usage”. If you go over what you requested, your job will be terminated.
This is also needed, as GPU’s work a little differently – you need a certain amount of CPU & RAM to feed a GPU and make use of it in a decent manner. As such, if you ask for a GPU, 250MB of RAM and 1CPU, your job will be boosted to a level than can actually supply the data to the GPU in a fashion. SLURM will override your job settings in this case, as a Tesla V100 GPU has 32GB of VRAM on it, so we (the administrators) can monitor the GPU/CPU/RAM ratios and set the ‘minimum’ amount. If you ask for more, that is fine. Think of it as a ‘minimum to use a GPU correctly’. This change also means that if you try and submit to a partition (e.g. hpc_mulfue) that is not yours (or you do not have permissions on) SLURM will deny your job outright. These values can be tweaked by the support team on the fly, so we can tailor it to how the GPU’s are being used.
Implementation of Fairshare & Priority System:
TL;DR:
- Every job gets a ‘Priority Number’
- Greedy users (lots of Compute) cannot start jobs if others with a better score are waiting
- This is adjustable ‘on-the-fly’ by administrators
- The more resources you ASK for (ASK, not use!) the worse your score. You ask for 64 Cores, you get billed for 64 Cores, even if you only used 3.
- Jobs will be stopped from STARTING at this point
Fairshare is the job allocation algorithm for SLURM that allows us (the admins) of the HPC to set weightings against specific resources (hence the change to how we allocate resources!), timed back-offs, dampening factors, account level usage weightings, contention, time waiting, job size and a whole host of other factors. If you want to read more on it, head on over to the SLURM home page and read about it, as its quite complicated.
What it boils down to is that the less compute you use, the less you are penalized in regards to other users. This works by tracking the resources that get allocated to you. If you skipped the TL;DR, this means that whatever resources you ask for, is what you are billed for, so I would highly advise you use the ‘seff’ tool to track your ‘usage vs. requested’ and tailor your SLURM scripts.
This means that if a user runs a heavy, 200-Job array and sets it going, it’s entirely possible that SLURM will suspend the start of step 187 (i.e., 186 just finished, and it wants to start step 187) to let a different job run due to the priority system. That means, if your Fairshare score is terrible (remember, this is comparative to others) and users with lower usage than you are waiting (and therefore a better Fairshare score) then your job will be put into ‘PENDING, Priority’, to indicate that your job is held due to Fairshare, and will run when it gets slotted in.
This score is calculated for ALL jobs, both RUNNING and WAITING every 3 minutes.
New Commands:
- ‘sprio’, displays the priority table for the cluster.
- ‘seff $JOB_ID’ displays some efficiency calculations on resource usage.
GPU’s online for usage :
TL;DR:
- ask for #SBATCH –gres=gpu:X where X is 1 or 2
- Get allocated # of GPU’s you ask for
You will now be able to request a resource type of gres/gpu:X where X is 1 or 2. There are 4 total GPU’s and are setup in a 2 per node allocation. You ask for the GPU’s and SLURM will allocated you as needed.
This requires a massive amount of setup for all aspects of the HPC (Software, Operating System, Modules, Hardware, Scheduler and Management Software) and while we will do our best to have them online the same day, they may not make the initial day of downtime, and worst case scenario, could take several weeks to bring them online. However, this will be a non-destructive change, and we can do this invisibly to you (the users), in the background.
A good note to have is that because we only have 4 GPU’s for the whole cluster, they impact your Fairshare score in a LARGE way. As an example, a CPU Core is worth 1. The GPU’s will start at a weighting of 5. So, do not be greedy!
Expansion of the Accounting System:
TL;DR:
- We can now track individual resources + esoteric ones
- Better historical data for usage tracking
This one goes hand in hand with the all the changes – the ‘sacct’ command will have fewer empty fields in it, as we expand the tracking to include more individual resources. As a user you can query the accounting database via this tool to grab historical data, and to see what SLURM has tracked regarding your resource usage – everything it has tracked, which can be up to 30 metrics! It’s quite a detailed tool, but is mainly for administrative tracking – lots of reports for us to really get into the fine-grained usage of the HPC and track down any potential bottlenecks or issues before they become a problem.
Thanks for reading all to the way to then end, and if you have any questions, please don’t hesitate to email me (Scott Anderson) via scott.anderson@flinders.edu.au