This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:services:application_services:high_performance_computing:hail [2019/04/09 10:37]
tehlers [Submitting Spark Applications]
— (current)
Line 1: Line 1:
-====== Hail ====== 
-===== Introduction ===== 
-//Hail is an open-source, scalable framework for exploring and analyzing genomic data.// ([[https://hail.is/|hail.is]]) 
-The HPC system runs version ''0.2 beta'' which can be obtained from [[https://github.com/hail-is/hail|GitHub]]. The cluster installation has been performed by mostly following the instructions for [[https://hail.is/docs/devel/installation.html#running-on-a-spark-cluster|Running on a cluster]]. 
-===== Preparing a Spark Cluster ===== 
-Hail runs on top of an [[https://spark.apache.org/docs/latest/cluster-overview.html|Apache Spark]] cluster. Before starting an interactive Hail session, a Standalone Spark cluster, consisting of a master and several workers, needs to be prepared. 
-==== Environment Variables ==== 
-Start by loading the modules for the ''Oracle JDK 1.8.0'' and ''Spark 2.3.1'': 
-module load JAVA/jdk1.8.0_31 spark/2.3.1 
-Spark will attempt to write logs into the global installation directory, which is read-only, so please specify a log directory via the environment variable ''SPARK_LOG_DIR''. For example, to use the directory ''spark-logs'' in your home directory, enter (or add to ''~/.bashrc'') the following: 
-export SPARK_LOG_DIR=$HOME/spark-logs 
-==== Submitting Spark Applications ==== 
-<WRAP center round info 60%> 
-If you're just interested in running Hail, you can safely [[en:services:application_services:high_performance_computing:hail#running_hail|skip ahead]]. 
-Applications can be submitted almost as described in the [[https://spark.apache.org/docs/latest/submitting-applications.html#submitting-applications|Spark documentation]] but the submission has to be wrapped inside an LSF job like the one given by the following script 
-#SBATCH -p medium 
-#SBATCH -N 4 
-#SBATCH --ntasks-per-node=1 
-#SBATCH -t 01:00:00 
-lsf-spark-submit.sh $SPARK_ARGS 
-where ''spark-submit'' has been replaced by ''lsf-spark-submit.sh'' and ''$SPARK_ARGS'' are the submit arguments without the ''--master'' argument - this will be added automatically, depending on which cluster node the master has been launched on. Because of ''-N 4'' there are 4 nodes in total and ''--ntasks-per-node=1'' ensures that one worker per node is started. 
-==== Interactive Sessions ==== 
-A Spark cluster to be used with Scala from the [[https://spark.apache.org/docs/latest/quick-start.html|interactive console]] can be spawned in a similar fashion, except we start an interactive LSF job and use the wrapper script ''lsf-spark-shell.py'' instead: 
-bsub -q int -n 4 -R span[ptile=1] -W 01:00 -ISs lsf-spark-shell.sh 
-===== Running Hail ===== 
-The Hail user interface requires at least ''Python 3.6'' so we load the corresponding module as well as the one for the application itself: 
-module load python/3.6.3 HAIL/0.2 
-Currently the following python packages are loaded by ''HAIL/0.2'' as well: 
-Package         Version 
---------------- ------- 
-bokeh           0.13.0  
-Jinja2          2.10    
-MarkupSafe      1.0     
-numpy           1.15.0  
-packaging       17.1    
-pandas          0.23.3  
-parsimonious    0.8.1   
-pip             18.0    
-pyparsing       2.2.0   
-pyspark         2.3.1   
-python-dateutil 2.7.3   
-pytz            2018.5  
-PyYAML          3.13    
-scipy           1.1.0   
-setuptools      28.8.0  
-six             1.11.0  
-tornado         5.1     
-wheel           0.29.0 
-<WRAP center round help 60%> 
-Do you need additional Python packages for your Hail workflow that might also be of interest to other users? In that case, please create an [[mailto:hpc@gwdg.de|HPC support ticket]]. Alternatively, you can use ''HAIL/0.2_novenv'' instead - this module relies on user-provided virtual environments, so you can manage the environment single-handedly. However, at least the following set of modules is required for Hail to function correctly: ''bokeh pandas parsimonious scipy'' 
-An LSF job running the ''pyspark''-based console for Hail can then be submitted as follows: 
-bsub -q int -n 4 -R span[ptile=1] -W 01:00 -ISs lsf-pyspark-hail.sh 
-Once the console is running, initialize hail with the global Spark context ''sc'' in the following way: 
-import hail as hl 
- --- //[[christian.koehler@gwdg.de|ckoehle2]] 2018/08/03 15:21//