Table of Contents
Hail
Introduction
Hail is an open-source, scalable framework for exploring and analyzing genomic data. (hail.is)
The HPC system runs version 0.2 beta
which can be obtained from GitHub. The cluster installation has been performed by mostly following the instructions for Running on a cluster.
Preparing a Spark Cluster
Hail runs on top of an Apache Spark cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.
Environment Variables
Start by loading the modules for the Oracle JDK 1.8.0
and Spark 2.3.1
:
module load JAVA/jdk1.8.0_31 spark/2.3.1
Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable SPARK_LOG_DIR
. For example, to use the directory spark-logs
in your home directory, enter (or add to ~/.bashrc
) the following:
export SPARK_LOG_DIR=$HOME/spark-logs
Submitting Spark Applications
If you're just interested in running Hail, you can safely skip ahead.
Applications can be submitted almost as described in the Spark documentation but the submission has to be wrapped inside an LSF job like the one given by the following script
#!/bin/bash #SBATCH -p medium #SBATCH -N 4 #SBATCH --ntasks-per-node=1 #SBATCH -t 01:00:00 lsf-spark-submit.sh $SPARK_ARGS
where spark-submit
has been replaced by lsf-spark-submit.sh
and $SPARK_ARGS
are the submit arguments without the –master
argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of -N 4
there are 4 nodes in total and --ntasks-per-node=1
ensures that one worker per node is started.
Interactive Sessions
A Spark cluster to be used with Scala from the interactive console can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script lsf-spark-shell.py
instead:
srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh
Running Hail
The Hail user interface requires at least Python 3.6
so we load the corresponding module as well as the one for the application itself:
module load python/3.6.3 HAIL/0.2
Currently the following python packages are loaded by HAIL/0.2
as well:
Package Version --------------- ------- bokeh 0.13.0 Jinja2 2.10 MarkupSafe 1.0 numpy 1.15.0 packaging 17.1 pandas 0.23.3 parsimonious 0.8.1 pip 18.0 pyparsing 2.2.0 pyspark 2.3.1 python-dateutil 2.7.3 pytz 2018.5 PyYAML 3.13 scipy 1.1.0 setuptools 28.8.0 six 1.11.0 tornado 5.1 wheel 0.29.0
Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an HPC support ticket. Alternatively, you can use HAIL/0.2_novenv
instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: bokeh pandas parsimonious scipy
An LSF job running the pyspark
-based console for Hail can then be submitted as follows:
srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh
Once the console is running, initialize hail with the global Spark context sc
in the following way:
import hail as hl hl.init(sc)
— ckoehle2 2018/08/03 15:21