Resource Monitoring and Reports
- During Runtime
- After the Job finished / Reports

Resource Monitoring and Reports

When debugging and optimizing your application, it is important to know what is actually happenning inside the node. This can either be done during runtime of the job, or once the job is finished.

During Runtime

While the job is running, you can use ssh to get into the nodes where your jobs run. Use squeue --me to see which nodes your job run on. Once logged in to the node, you can take a look at the resource usage with the standard linux commands, such as htop or free. Please keep in mind that this will show ALL resources of the node, not just those allocated to you.

After the Job finished / Reports

To get resource usage information about your job after it finished, you can use the tool reportseff. This tools queries Slurm to get your allocated resources and compares it to the actually used resources (as reported by Slurm). This can give you a great overview about your usage: Did I use all my cores and all my memory. Was my time limit too long?
Usage example:

# Display your recent jobs
gwdu101:121 14:00:00 ~ > module load py-reportseff
gwdu101:121 14:00:05 ~ > reportseff -u $USER
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
 
  12671730  COMPLETED    00:00:01   0.0%      ---      0.0%  
  12671731  COMPLETED    00:00:00   0.0%      ---      0.0%  
  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%
 
 
# Give specific Job ID:
gwdu103:29 14:07:17 ~ > reportseff 12701482
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  12701482  CANCELLED    00:04:20   7.2%     49.6%     0.0%

As you can see in my last example, I used only 4:20 minutes of my 1h allocated, resulting in a TimeEfficiency of 7.2%. I used only half my allocated cores (I allocated two and used only one) and basically non of my allocated memory. Next time, I should reduce the time limit, request one core less and definitely request less memory.

Another tools is 'profit-hpc'. For jobs longer than one hour, it will print a detailed usage summary of all nodes, CPUs, GPUs and even give some general advice. You can use it with profit-hpc <jobid>

Table of Contents

Resource Monitoring and Reports

During Runtime

After the Job finished / Reports