====== Running Jobs with Slurm====== In the following the basic concepts will be described. **Cluster**\\ A collection of networked computers intended to provide compute capabilities. **Node**\\ One of these computers, also called host. **Frontend**\\ A special node provided to interact with the cluster via shell commands. gwdu101 and gwdu102 are our frontends. **Task or (Job-)Slot**\\ Compute capacity for one process (or "thread") at a time, usually one processor core, or CPU for short. **Job**\\ A compute task consisting of one or several parallel processes. **Batch System**\\ The management system distributing job processes across job slots. In our case [[https://slurm.schedmd.com|Slurm]], which is operated by shell commands on the frontends. **Serial job**\\ A job consisting of one process using one job slot. **SMP job**\\ A job with shared memory parallelization (often realized with OpenMP), meaning that all processes need access to the memory of the same node. Consequently an SMP job uses several job slots //on the same node//. **MPI job**\\ A Job with distributed memory parallelization, realized with MPI. Can use several job slots on several nodes and needs to be started with ''mpirun'' or the Slurm substitute ''srun''. **Partition**\\ A label to sort jobs by general requirements and intended execution nodes. Formerly called "queue" ===== The sbatch Command: Submitting Jobs to the Cluster ===== ''sbatch'' submits information on your job to the batch system: * What is to be done? (path to your program and required parameters) * What are the requirements? (for example queue, process number, maximum runtime) Slurm then matches the job’s requirements against the capabilities of available job slots. Once sufficient suitable job slots are found, the job is started. Slurm considers jobs to be started in the order of their priority. ===== Available Partitions ===== We currently have two meta partitions, corresponding to broad application profiles: **medium**\\ This is our general purpose partition, usable for serial and SMP jobs with up to 24 tasks, but it is especially well suited for large MPI jobs. Up to 1024 cores can be used in a single MPI job, and the maximum runtime is 48 hours. **fat**\\ This is the partition for SMP jobs, especially those requiring lots of memory. Serial jobs with very high memory requirements do also belong in this partition. Up to 24 cores and up to 512 GB are available on one host. Maximum runtime is 48 hours.\\ The nodes of the fat+ partitions are also present in this partition, but will only be used, if they are not needed for bigger jobs submitted to the fat+ partition. **fat+**\\ This partition is meant for very memory intensive jobs. These partitions are for jobs that require more than 512 GB RAM on single node. Nodes of fat+ partitions have 1.5 and 2 TB RAM. You are required to have specify your memory needs on job submission to use these nodes (see [[en:services:application_services:high_performance_computing:running_jobs_slurm#resource_selection|resource selection]]).\\ As general advice: Try your jobs on the smaller nodes in the fat partition first and work your way up and don't be afraid to ask for help here. **gpu** - A partition for nodes containing GPUs. Please refer to [[en:services:application_services:high_performance_computing:running_jobs_slurm#gpu_selection]] ==== Runtime limits (QoS) ==== The default maximum time limit you can request using -t / --time is 48 hours. You can use a "Quality of Service" or **QoS** to modify this limit on a per job basis. (You still have to specify the actual runtime for your job using -t). Noteworthy QoS are: **normal**\\ Used by default. Unlike other QoS, this does not change/override the per-partition setting for maximum runtime. **2h**\\ Here, the maximum runtime is decreased to two hours. In turn the queue has a higher base priority, but it also has limited job slot availability. That means that as long as only few jobs are submitted using "--qos 2h", there will be minimal waiting times. This is intended for testing and development, not for massive production. **96h**\\ Increases maximum runtime to 96 hours. This is available on request, but you must have good reasons why you need the increased duration. Under normal circumstances, jobs should not run any longer than 48 hours. You have to prove that you tried and exhausted other technical solutions like snapshotting, splitting the job into independent parts or using more CPUs/nodes to reduce the runtime. If none of these are feasible for your workload, you can get access to this and even longer QoS. The longer the runtime you need, the more time and effort you have to show to have invested without finding a solution. ===== How to submit jobs ===== Slurm supports different ways to submit jobs to the cluster: Interactively or in batch mode. We generally recommend using the batch mode. If you need to run a job interactively, you can find information about that in the [[en:services:application_services:high_performance_computing:running_jobs_slurm#interactive_session_on_the_nodes|corresponding section]]. Batch jobs are submitted to the cluster using the 'sbatch' command and a jobscript or a command:\\ sbatch [jobscript.sh | --wrap=]\\ **sbatch** can take a lot of options to give more information on the specifics of your job, e.g. where to run it, how long it will take and how many nodes it needs. We will examine a few of the options in the following paragraphs. For a full list of commands, refer to the manual of the command with 'man sbatch'. ==== sbatch/srun options ==== **-A all**\\ Specifies the account 'all' for the job. This option is //mandatory// for users who have access to special hardware and want to use the general partitions. **-p **\\ Specifies in which partition the job should run. Multiple partitions can be specified in a comma separated list. **-t **\\ Maximum runtime of the job. If this time is exceeded the job is killed. Acceptable