Table of Contents

Hail

Introduction

Hail is an open-source, scalable framework for exploring and analyzing genomic data. (hail.is)

The HPC system runs version 0.2 beta which can be obtained from GitHub. The cluster installation has been performed by mostly following the instructions for Running on a cluster.

Preparing a Spark Cluster

Hail runs on top of an Apache Spark cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.

Environment Variables

Start by loading the modules for the Oracle JDK 1.8.0 and Spark 2.3.1:

module load JAVA/jdk1.8.0_31 spark/2.3.1

Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable SPARK_LOG_DIR. For example, to use the directory spark-logs in your home directory, enter (or add to ~/.bashrc) the following:

export SPARK_LOG_DIR=$HOME/spark-logs

Submitting Spark Applications

If you're just interested in running Hail, you can safely skip ahead.

Applications can be submitted almost as described in the Spark documentation but the submission has to be wrapped inside an LSF job like the one given by the following script

#!/bin/bash
#SBATCH -p medium
#SBATCH -N 4
#SBATCH --ntasks-per-node=1
#SBATCH -t 01:00:00

lsf-spark-submit.sh $SPARK_ARGS

where spark-submit has been replaced by lsf-spark-submit.sh and $SPARK_ARGS are the submit arguments without the –master argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of -N 4 there are 4 nodes in total and --ntasks-per-node=1 ensures that one worker per node is started.

Interactive Sessions

A Spark cluster to be used with Scala from the interactive console can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script lsf-spark-shell.py instead:

srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh

Running Hail

The Hail user interface requires at least Python 3.6 so we load the corresponding module as well as the one for the application itself:

module load python/3.6.3 HAIL/0.2

Currently the following python packages are loaded by HAIL/0.2 as well:

Package         Version
--------------- -------
bokeh           0.13.0 
Jinja2          2.10   
MarkupSafe      1.0    
numpy           1.15.0 
packaging       17.1   
pandas          0.23.3 
parsimonious    0.8.1  
pip             18.0   
pyparsing       2.2.0  
pyspark         2.3.1  
python-dateutil 2.7.3  
pytz            2018.5 
PyYAML          3.13   
scipy           1.1.0  
setuptools      28.8.0 
six             1.11.0 
tornado         5.1    
wheel           0.29.0

Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an HPC support ticket. Alternatively, you can use HAIL/0.2_novenv instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: bokeh pandas parsimonious scipy

An LSF job running the pyspark-based console for Hail can then be submitted as follows:

srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh

Once the console is running, initialize hail with the global Spark context sc in the following way:

import hail as hl
hl.init(sc)

ckoehle2 2018/08/03 15:21