Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
en:services:application_services:high_performance_computing:running_jobs_slurm:resource_usage [2022/07/28 14:10] mbodenen:services:application_services:high_performance_computing:running_jobs_slurm:resource_usage [2023/12/11 11:38] (current) – [During Runtime] skrey
Line 1: Line 1:
 +====== Resource Monitoring and Reports ======
  
 +When debugging and optimizing your application, it is important to know what is actually happenning inside the node. This can either be done during runtime of the job, or once the job is finished.
 +
 +===== During Runtime =====
 +
 +While the job is running, you can use ''ssh'' to get into the nodes where your jobs run. Use ''%%squeue --me%%'' to see which nodes your job run on. Once logged in to the node, you can take a look at the resource usage with the standard linux commands, such as [[https://linuxhint.com/htop-colors-meaning/|htop]] or [[https://linuxize.com/post/free-command-in-linux/|free]]. Please keep in mind that this will show ALL resources of the node, not just those allocated to you.
 +
 +===== After the Job finished / Reports =====
 +
 +To get resource usage information about your job after it finished, you can use the tool [[https://github.com/troycomi/reportseff|reportseff]]. This tools queries Slurm to get your allocated resources and compares it to the actually used resources (as reported by Slurm). This can give you a great overview about your usage: Did I use all my cores and all my memory. Was my time limit too long? \\ Usage example:
 +<code bash>
 +# Display your recent jobs
 +gwdu101:121 14:00:05 ~ > reportseff -u $USER
 +     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
 +
 +  12671730  COMPLETED    00:00:01   0.0%      ---      0.0%  
 +  12671731  COMPLETED    00:00:00   0.0%      ---      0.0%  
 +  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%
 +  
 +  
 +# Give specific Job ID:
 +gwdu103:29 14:07:17 ~ > reportseff 12701482
 +     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
 +  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%  
 +</code>
 +
 +As you can see in my last example, I used only 4:20 minutes of my 1h allocated, resulting in a TimeEfficiency of 7.2%. I used only half my allocated cores (I allocated two and used only one) and basically non of my allocated memory. Next time, I should reduce the time limit, request one core less and definitely request less memory.
 +
 +Another tools is 'profit-hpc'. For jobs longer than one hour, it will print a detailed usage summary of all nodes, CPUs, GPUs and even give some general advice. You can use it with ''profit-hpc <jobid>''