====== Hail ======
===== Introduction =====
//Hail is an open-source, scalable framework for exploring and analyzing genomic data.// ([[https://hail.is/|hail.is]])

The HPC system runs version ''0.2 beta'' which can be obtained from [[https://github.com/hail-is/hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://hail.is/docs/devel/installation.html#running-on-a-spark-cluster|Running on a cluster]].
===== Preparing a Spark Cluster =====
Hail runs on top of an [[https://spark.apache.org/docs/latest/cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.
==== Environment Variables ====
Start by loading the modules for the ''Oracle JDK 1.8.0'' and ''Spark 2.3.1'':
<code>
module load JAVA/jdk1.8.0_31 spark/2.3.1
</code>
Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''SPARK_LOG_DIR''. For example, to use the directory ''spark-logs'' in your home directory, enter (or add to ''~/.bashrc'') the following:
<code>
export SPARK_LOG_DIR=$HOME/spark-logs
</code>
==== Submitting Spark Applications ====
<WRAP center round info 60%>
If you're just interested in running Hail, you can safely [[en:services:application_services:high_performance_computing:hail#running_hail|skip ahead]].
</WRAP>

Applications can be submitted almost as described in the [[https://spark.apache.org/docs/latest/submitting-applications.html#submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script
<code>
#!/bin/bash
#SBATCH -p medium
#SBATCH -N 4
#SBATCH --ntasks-per-node=1
#SBATCH -t 01:00:00

lsf-spark-submit.sh $SPARK_ARGS
</code>
where ''spark-submit'' has been replaced by ''lsf-spark-submit.sh'' and ''$SPARK_ARGS'' are the submit arguments without the ''--master'' argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of ''-N 4'' there are 4 nodes in total and ''<nowiki>--ntasks-per-node=1</nowiki>'' ensures that one worker per node is started.

==== Interactive Sessions ====
A Spark cluster to be used with Scala from the [[https://spark.apache.org/docs/latest/quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''lsf-spark-shell.py'' instead:
<code>
srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh
</code>
===== Running Hail =====
The Hail user interface requires at least ''Python 3.6'' so we load the corresponding module as well as the one for the application itself:
<code>
module load python/3.6.3 HAIL/0.2
</code>
Currently the following python packages are loaded by ''HAIL/0.2'' as well:
<code>
Package         Version
--------------- -------
bokeh           0.13.0 
Jinja2          2.10   
MarkupSafe      1.0    
numpy           1.15.0 
packaging       17.1   
pandas          0.23.3 
parsimonious    0.8.1  
pip             18.0   
pyparsing       2.2.0  
pyspark         2.3.1  
python-dateutil 2.7.3  
pytz            2018.5 
PyYAML          3.13   
scipy           1.1.0  
setuptools      28.8.0 
six             1.11.0 
tornado         5.1    
wheel           0.29.0
</code>
<WRAP center round help 60%>
Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:hpc-support@gwdg.de|HPC support ticket]]. Alternatively, you can use ''HAIL/0.2_novenv'' instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''bokeh pandas parsimonious scipy''
</WRAP>

An LSF job running the ''pyspark''-based console for Hail can then be submitted as follows:
<code>
srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh
</code>
Once the console is running, initialize hail with the global Spark context ''sc'' in the following way:
<code>
import hail as hl
hl.init(sc)
</code>

 --- //[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//