Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:faq [2021/07/05 11:51] – [How can I acknowledge the GWD HPC system in my paper?] mbodenen:services:application_services:high_performance_computing:faq [2022/07/22 08:51] (current) – [I'm running out of space. How can I increase my quota?] vend
Line 1: Line 1:
 +====== Frequently Asked Questions ======
 +
 +===== How can I acknowledge the GWDG HPC system in my paper? =====
 +Please add this to your acknowledgements in your paper:
 +> This work used the Scientific Compute Cluster at GWDG, the joint data center of Max Planck Society for the Advancement of Science (MPG) and University of Göttingen.
 +Additionally, we would be happy if you write us an [[mailto:hpc-support@gwdg.de|email]] to notify us.
 +
 +===== Job Killed with oom-kill events or out-of-memory handler =====
 +Problem: Your job is killed with a message like this:
 +<code>slurmstepd: error: Detected 1 oom-kill event(s) in StepId=[JOBID].batch
 +cgroup. Some of your processes may have been killed by the cgroup
 +out-of-memory handler.</code>
 +Solution: Your job ran out of memory, ie. your program used more memory/RAM than you requested. Please request [[en:services:application_services:high_performance_computing:running_jobs_slurm#memory_selection|more memory]].
 +
 +===== MPI Program crashes with "Illegal instruction" =====
 +The MPI program crashes with ''Illegal instruction'', sometimes buried in lots of errors like this:
 +<code>[1623330291.399962] [dmp029:32333:0]          debug.c:1358 UCX  WARN  ucs_recursive_spinlock_destroy() failed (-15)
 +[1623330291.429578] [dmp029:32333:0]          debug.c:1358 UCX  WARN  ucs_recursive_spinlock_destroy() failed (-15)
 +[dmp029:32333:0:32333] Caught signal 4 (Illegal instruction: tkill(2) or tgkill(2))
 +==== backtrace (tid:  32332) ====
 + 0 0x0000000000051ffe ucs_debug_print_backtrace()  /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:656
 + 1 0x0000000000053096 ucs_debug_save_original_sighandler()  /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:1208
 + 2 0x0000000000053096 ucs_set_signal_handler()  /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:1245
 + 3 0x0000000000053096 ucs_debug_init()  /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/debug/debug.c:1319
 + 4 0x000000000003f30c ucs_init()  /dev/shm/spack/build/parallel/spack-stage-ucx-1.9.0-kdxeqz6csc6whvnmxeqijga5e5nugrgs/spack-src/src/ucs/sys/init.c:91
 + 5 0x000000000000f9c3 _dl_init_internal()  :0
 + 6 0x000000000000117a _dl_start_user()  :0
 +</code>
 +
 +This is usually due to differences in processor architecure between the machine the code was compiled on and the machine the program ran on.
 +
 +Specifically: If you compiled your code a newer system, such as our frontends gwdu101 and gwdu101 and try to run it on one of the older nodes, such das the dmp or dfa nodes, it will crash with an error like this.
 +
 +To mitigate this, either add ''#SBATCH -C cascadelake'' to you jobscript to limit it to nodes with a Cascade Lake processor, or compile it on our older frontend gwdu103.
 +
 +===== Jobs Pending with (QOSGrpCpuLimit) =====
 +If your jobs are pending with ''(QOSGrpCpuLimit)'', it means that all the global job slots for the QoS are currently used. It has nothing to do with your user being limited. We have a global limit on 2000 cores being used simultaneously in the ''long''-QoS. Your job has to wait until enough cores are available.
 +
 +===== Job crashes without output or error. (Especially when using scratch) =====
 +If the directory of your output/error file (''#SBATCH -o/-e'') does not exist, Slurm can not create the output file and the job crashes. But now Slurm has no way of telling you what went wrong, as it can't write errors anywhere, which results in these silent crashes.
 +
 +This is especially frequent when using /scratch as your work directory. Your directory structure may exist on one of the scratch file system, but not on the other. So if you do not specify ''#SBATCH -C scratch[2]'', you may end up on the wrong file system and the error above happens. Keep in mind that the file system on frontend-fas (aka gwdu103), is scratch2, not scratch.
 +
 +===== I'm running out of space. How can I increase my quota? =====
 +A: Write a mail to [[mailto:support@gwdg.de?subject=Increase Quota|support@gwdg.de]] requesting more quota. Please include your username and the amount of storage space you need. Please always consider using scratch or moving old files into the archive beforehand.
 +
 +If you would like to know your current quota and how much of it you are already using, you can use the command ''Quota''.
 +
 +===== Can I get HPC access with my student account? =====
 +No, you cannot use the HPC system with your account associated to the stud.uni-goettingen.de domain. You need a full GWDG account. If you are employed by the University, the Max Planck Society or the Universities Medical Centre you have an "Einheitlicher Mitarbeiter Account". For more information, see [[en:services:application_services:high_performance_computing:account_activation|Account Activation]].
 +
 +===== The limit of 120 hours (i.e. five days) time limit is too short, my jobs need more time. Is that possible? =====
 +We do not encourage longer runtimes than five days, as the probability of failure increases over time. Instead, we highly recommend job dependency chains and checkpointing your program. If those chains are not feasible in your case, please contact hpc-support@gwdg.de
 +
 +===== How much does HPC usage cost? =====
 +All our services, including the usage of our HPC resources, is accounted in [[https://www.gwdg.de/about-us/catalog/kontingentierung/hauptmerkmale-der-kontingentierung|Arbeitseinheiten (AE)]]. For the current pricing, see the [[http://lotus1.gwdg.de/gwdgdb/Katalog.nsf/376aa1b45ea0519ac12569bc004b3ba1/d0255685063c2e2ac1256b90002b27b2?OpenDocument|Dienstleistungskatalog]].
 +
 +===== When will my job start? =====
 +A: You can use ''scontrol show job <JobID> | grep StartTime to get an estimate for the starttime of your job. However, this information is not always available, or accurate, and depends on many factors.
 +
 +===== Why can't I use Gaussian? =====
 +A: Access to Gaussian is restricted due to license requirements. More information can be found [[en:services:application_services:high_performance_computing:software:gaussian|here]].
 +