Please add this to your acknowledgements in your paper:
This work used the Scientific Compute Cluster at GWDG, the joint data center of Max Planck Society for the Advancement of Science (MPG) and University of Göttingen.
Additionally, we would be happy if you write us an email to notify us.
Problem: Your job is killed with a message like this:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=[JOBID].batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Solution: Your job ran out of memory, ie. your program used more memory/RAM than you requested. Please request more memory.
The MPI program crashes with Illegal instruction
, sometimes buried in lots of errors like this:
[1623330291.399962] [dmp029:32333:0] debug.c:1358 UCX WARN ucs_recursive_spinlock_destroy() failed (-15) [1623330291.429578] [dmp029:32333:0] debug.c:1358 UCX WARN ucs_recursive_spinlock_destroy() failed (-15) [dmp029:32333:0:32333] Caught signal 4 (Illegal instruction: tkill(2) or tgkill(2)) ==== backtrace (tid: 32332) ==== 0 0x0000000000051ffe ucs_debug_print_backtrace() /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:656 1 0x0000000000053096 ucs_debug_save_original_sighandler() /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:1208 2 0x0000000000053096 ucs_set_signal_handler() /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:1245 3 0x0000000000053096 ucs_debug_init() /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:1319 4 0x000000000003f30c ucs_init() /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/sys/init.c:91 5 0x000000000000f9c3 _dl_init_internal() :0 6 0x000000000000117a _dl_start_user() :0
This is usually due to differences in processor architecure between the machine the code was compiled on and the machine the program ran on.
Specifically: If you compiled your code a newer system, such as our frontends gwdu101 and gwdu101 and try to run it on one of the older nodes, such das the dmp or dfa nodes, it will crash with an error like this.
To mitigate this, please add #SBATCH -C cascadelake
to your jobscript to limit it to nodes with a Cascade Lake processor.
If your jobs are pending with (QOSGrpCpuLimit)
, it means that all the global job slots for the QoS are currently used. It has nothing to do with your user being limited. We have a global limit on 2000 cores being used simultaneously in the long
-QoS. Your job has to wait until enough cores are available.
If the directory of your output/error file (#SBATCH -o/-e
) does not exist, Slurm can not create the output file and the job crashes. But now Slurm has no way of telling you what went wrong, as it can't write errors anywhere, which results in these silent crashes.
This is especially frequent when using /scratch as your work directory. Your directory structure may exist on one of the scratch file system, but not on the other. So if you do not specify #SBATCH -C scratch[2]
, you may end up on the wrong file system and the error above happens.
A: Write a mail to support@gwdg.de requesting more quota. Please include your username and the amount of storage space you need. Please always consider using scratch or moving old files into the archive beforehand.
If you would like to know your current quota and how much of it you are already using, you can use the command Quota
.
No, you cannot use the HPC system with your account associated to the stud.uni-goettingen.de domain. You need a full GWDG account. If you are employed by the University, the Max Planck Society or the Universities Medical Centre you have an “Einheitlicher Mitarbeiter Account”. For more information, see Account Activation.
We do not encourage longer runtimes than five days, as the probability of failure increases over time. Instead, we highly recommend job dependency chains and checkpointing your program. If those chains are not feasible in your case, please contact hpc-support@gwdg.de
All our services, including the usage of our HPC resources, is accounted in Arbeitseinheiten (AE). For the current pricing, see the Dienstleistungskatalog.
A: You can use ''scontrol show job <JobID> | grep StartTime to get an estimate for the starttime of your job. However, this information is not always available, or accurate, and depends on many factors.
A: Access to Gaussian is restricted due to license requirements. More information can be found here.