When debugging and optimizing your application, it is important to know what is actually happenning inside the node. This can either be done during runtime of the job, or once the job is finished.
While the job is running, you can use ssh
to get into the nodes where your jobs run. Use squeue --me
to see which nodes your job run on. Once logged in to the node, you can take a look at the resource usage with the standard linux commands, such as htop or free. Please keep in mind that this will show ALL resources of the node, not just those allocated to you.
To get resource usage information about your job after it finished, you can use the tool reportseff. This tools queries Slurm to get your allocated resources and compares it to the actually used resources (as reported by Slurm). This can give you a great overview about your usage: Did I use all my cores and all my memory. Was my time limit too long?
Usage example:
# Display your recent jobs gwdu101:121 14:00:00 ~ > module load py-reportseff gwdu101:121 14:00:05 ~ > reportseff -u $USER JobID State Elapsed TimeEff CPUEff MemEff 12671730 COMPLETED 00:00:01 0.0% --- 0.0% 12671731 COMPLETED 00:00:00 0.0% --- 0.0% 12701482 CANCELLED 00:04:20 7.2% 49.6% 0.0% # Give specific Job ID: gwdu102:29 14:07:17 ~ > reportseff 12701482 JobID State Elapsed TimeEff CPUEff MemEff 12701482 CANCELLED 00:04:20 7.2% 49.6% 0.0%
As you can see in my last example, I used only 4:20 minutes of my 1h allocated, resulting in a TimeEfficiency of 7.2%. I used only half my allocated cores (I allocated two and used only one) and basically non of my allocated memory. Next time, I should reduce the time limit, request one core less and definitely request less memory.
Another tools is 'profit-hpc'. For jobs longer than one hour, it will print a detailed usage summary of all nodes, CPUs, GPUs and even give some general advice. You can use it with
profit-hpc <jobid>