====== Hail ======
===== Introduction =====
//Hail is an open-source, scalable framework for exploring and analyzing genomic data.// ([[https://hail.is/|hail.is]])
The HPC system runs version ''0.2 beta'' which can be obtained from [[https://github.com/hail-is/hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://hail.is/docs/devel/installation.html#running-on-a-spark-cluster|Running on a cluster]].
===== Preparing a Spark Cluster =====
Hail runs on top of an [[https://spark.apache.org/docs/latest/cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.
==== Environment Variables ====
Start by loading the modules for the ''Oracle JDK 1.8.0'' and ''Spark 2.3.1'':
module load JAVA/jdk1.8.0_31 spark/2.3.1
Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''SPARK_LOG_DIR''. For example, to use the directory ''spark-logs'' in your home directory, enter (or add to ''~/.bashrc'') the following:
export SPARK_LOG_DIR=$HOME/spark-logs
==== Submitting Spark Applications ====
If you're just interested in running Hail, you can safely [[en:services:application_services:high_performance_computing:hail#running_hail|skip ahead]].
Applications can be submitted almost as described in the [[https://spark.apache.org/docs/latest/submitting-applications.html#submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script
#!/bin/bash
#SBATCH -p medium
#SBATCH -N 4
#SBATCH --ntasks-per-node=1
#SBATCH -t 01:00:00
lsf-spark-submit.sh $SPARK_ARGS
where ''spark-submit'' has been replaced by ''lsf-spark-submit.sh'' and ''$SPARK_ARGS'' are the submit arguments without the ''--master'' argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of ''-N 4'' there are 4 nodes in total and ''--ntasks-per-node=1'' ensures that one worker per node is started.
==== Interactive Sessions ====
A Spark cluster to be used with Scala from the [[https://spark.apache.org/docs/latest/quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''lsf-spark-shell.py'' instead:
srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh
===== Running Hail =====
The Hail user interface requires at least ''Python 3.6'' so we load the corresponding module as well as the one for the application itself:
module load python/3.6.3 HAIL/0.2
Currently the following python packages are loaded by ''HAIL/0.2'' as well:
Package Version
--------------- -------
bokeh 0.13.0
Jinja2 2.10
MarkupSafe 1.0
numpy 1.15.0
packaging 17.1
pandas 0.23.3
parsimonious 0.8.1
pip 18.0
pyparsing 2.2.0
pyspark 2.3.1
python-dateutil 2.7.3
pytz 2018.5
PyYAML 3.13
scipy 1.1.0
setuptools 28.8.0
six 1.11.0
tornado 5.1
wheel 0.29.0
Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:hpc-support@gwdg.de|HPC support ticket]]. Alternatively, you can use ''HAIL/0.2_novenv'' instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''bokeh pandas parsimonious scipy''
An LSF job running the ''pyspark''-based console for Hail can then be submitted as follows:
srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh
Once the console is running, initialize hail with the global Spark context ''sc'' in the following way:
import hail as hl
hl.init(sc)
--- //[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//