Differences

This shows you the differences between two versions of the page.

Link to this comparison view

en:services:application_services:high_performance_computing:software:hail [2021/04/22 15:02] – created mbodenen:services:application_services:high_performance_computing:software:hail [2021/12/06 11:40] (current) mboden
Line 1: Line 1:
 +====== Hail ======
 +===== Introduction =====
 +//Hail is an open-source, scalable framework for exploring and analyzing genomic data.// ([[https://hail.is/|hail.is]])
  
 +The HPC system runs version ''0.2 beta'' which can be obtained from [[https://github.com/hail-is/hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://hail.is/docs/devel/installation.html#running-on-a-spark-cluster|Running on a cluster]].
 +===== Preparing a Spark Cluster =====
 +Hail runs on top of an [[https://spark.apache.org/docs/latest/cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared.
 +==== Environment Variables ====
 +Start by loading the modules for the ''Oracle JDK 1.8.0'' and ''Spark 2.3.1'':
 +<code>
 +module load JAVA/jdk1.8.0_31 spark/2.3.1
 +</code>
 +Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''SPARK_LOG_DIR''. For example, to use the directory ''spark-logs'' in your home directory, enter (or add to ''~/.bashrc'') the following:
 +<code>
 +export SPARK_LOG_DIR=$HOME/spark-logs
 +</code>
 +==== Submitting Spark Applications ====
 +<WRAP center round info 60%>
 +If you're just interested in running Hail, you can safely [[en:services:application_services:high_performance_computing:hail#running_hail|skip ahead]].
 +</WRAP>
 +
 +Applications can be submitted almost as described in the [[https://spark.apache.org/docs/latest/submitting-applications.html#submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script
 +<code>
 +#!/bin/bash
 +#SBATCH -p medium
 +#SBATCH -N 4
 +#SBATCH --ntasks-per-node=1
 +#SBATCH -t 01:00:00
 +
 +lsf-spark-submit.sh $SPARK_ARGS
 +</code>
 +where ''spark-submit'' has been replaced by ''lsf-spark-submit.sh'' and ''$SPARK_ARGS'' are the submit arguments without the ''--master'' argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of ''-N 4'' there are 4 nodes in total and ''<nowiki>--ntasks-per-node=1</nowiki>'' ensures that one worker per node is started.
 +
 +==== Interactive Sessions ====
 +A Spark cluster to be used with Scala from the [[https://spark.apache.org/docs/latest/quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''lsf-spark-shell.py'' instead:
 +<code>
 +srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-spark-shell.sh
 +</code>
 +===== Running Hail =====
 +The Hail user interface requires at least ''Python 3.6'' so we load the corresponding module as well as the one for the application itself:
 +<code>
 +module load python/3.6.3 HAIL/0.2
 +</code>
 +Currently the following python packages are loaded by ''HAIL/0.2'' as well:
 +<code>
 +Package         Version
 +--------------- -------
 +bokeh           0.13.0 
 +Jinja2          2.10   
 +MarkupSafe      1.0    
 +numpy           1.15.0 
 +packaging       17.1   
 +pandas          0.23.3 
 +parsimonious    0.8.1  
 +pip             18.0   
 +pyparsing       2.2.0  
 +pyspark         2.3.1  
 +python-dateutil 2.7.3  
 +pytz            2018.5 
 +PyYAML          3.13   
 +scipy           1.1.0  
 +setuptools      28.8.0 
 +six             1.11.0 
 +tornado         5.1    
 +wheel           0.29.0
 +</code>
 +<WRAP center round help 60%>
 +Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:hpc-support@gwdg.de|HPC support ticket]]. Alternatively, you can use ''HAIL/0.2_novenv'' instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''bokeh pandas parsimonious scipy''
 +</WRAP>
 +
 +An LSF job running the ''pyspark''-based console for Hail can then be submitted as follows:
 +<code>
 +srun -p int -N 4 --ntasks-per-node=20 -t 01:00:00 lsf-pyspark-hail.sh
 +</code>
 +Once the console is running, initialize hail with the global Spark context ''sc'' in the following way:
 +<code>
 +import hail as hl
 +hl.init(sc)
 +</code>
 +
 + --- //[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//