Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
en:services:application_services:high_performance_computing:software:spark [2023/06/03 09:42] – [Creating a Spark Cluster on the SCC] vend | en:services:application_services:high_performance_computing:software:spark [2024/05/31 11:18] (current) – add MODULEPATH extension ckoehle2 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Apache Spark ====== | ||
+ | ===== Introduction ===== | ||
+ | [[https:// | ||
+ | |||
+ | Instead of the classic Map Reduce Pipeline, Spark’s central concept is a resilient distributed dataset (RDD) which is operated on with the help of a central driver program making use of the parallel operations and the scheduling and I/O facilities which Spark provides. Transformations on the RDD are executed by the worker nodes in the Spark cluster. The dataset is resilient because Spark automatically handles failures in the Worker nodes by [[https:// | ||
+ | |||
+ | In the following sections, we give a short introduction on how to prepare a Spark cluster and run applications on it in the Scientific Compute Cluster. | ||
+ | ===== Creating a Spark Cluster on the SCC ===== | ||
+ | <WRAP center round important 60%> | ||
+ | We assume that you have access to the HPC system already and are logged in to one of the frontend nodes '' | ||
+ | </ | ||
+ | |||
+ | Apache Spark is installed in version 3.4.0, the most recent stable release at the time of this writing. Version 2.4.3 is available as well. The shell environment is prepared by loading the module '' | ||
+ | |||
+ | < | ||
+ | gwdu102 ~ > export MODULEPATH=/ | ||
+ | gwdu102 ~ > module load spark/3.4.0 | ||
+ | </ | ||
+ | |||
+ | We’re now ready to deploy a Spark cluster. Since the resources of the HPC system are managed by [[en: | ||
+ | |||
+ | < | ||
+ | #SBATCH --partition fat | ||
+ | #SBATCH --time=0-02: | ||
+ | #SBATCH --qos=short | ||
+ | #SBATCH --nodes=4 | ||
+ | #SBATCH --job-name=Spark | ||
+ | #SBATCH --output=scc_spark_job-%j.out | ||
+ | #SBATCH --ntasks-per-node=1 | ||
+ | #SBATCH --cpus-per-task=24 | ||
+ | </ | ||
+ | |||
+ | If you would like to override these default values, you can do so, by handing over the Slurm parameters to the script: | ||
+ | |||
+ | < | ||
+ | gwdu102 ~ > scc_spark_deploy.sh --nodes=2 --time=02: | ||
+ | Submitted batch job 872699 | ||
+ | </ | ||
+ | |||
+ | Especially, if you do not want to share the nodes resources, you need to add '' | ||
+ | |||
+ | In this case, the '' | ||
+ | |||
+ | < | ||
+ | gwdu102 ~ > squeue --jobs=872699 | ||
+ | JOBID PARTITION | ||
+ | 872699 | ||
+ | </ | ||
+ | |||
+ | The first node reported in the // | ||
+ | |||
+ | < | ||
+ | gwdu102 ~ > spark-shell --master spark:// | ||
+ | </ | ||
+ | |||
+ | Here, the Spark shell is started on the frontend node '' | ||
+ | {{ : | ||
+ | Scala code that is entered in this shell and parallelized with Spark will be automatically distributed across all nodes that have been requested initially. N.B.: The port that the application’s web interface is listening on (port '' | ||
+ | |||
+ | Once the Spark cluster is not needed anymore, it can be shut down gracefully by using the provided script '' | ||
+ | |||
+ | < | ||
+ | gwdu102 ~ > scc_spark_shutdown.sh 872699 | ||
+ | </ | ||
+ | |||
+ | In case a single node is sufficient, Spark applications can be started inside a Slurm job without previous cluster setup - the '' | ||
+ | |||
+ | < | ||
+ | gwdu102 ~ > spark-shell --master local[4] | ||
+ | </ | ||
+ | |||
+ | ===== Access and Monitoring ===== | ||
+ | Once your Spark cluster is running, information about the master and workers is being printed to the file '' | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | Inside [[en: | ||
+ | |||
+ | < | ||
+ | ssh -N -L 8080: | ||
+ | </ | ||
+ | |||
+ | ===== Example: Approximating Pi ===== | ||
+ | |||
+ | To showcase the capabilities of the Spark cluster set up thus far we enter a short [[https:// | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | The local dataset containing the integers from //1// to //1E9// is distributed across the executors using the parallelize function and filtered according to the rule that the random point //(x,y)// with //0 < x, y < 1// that is being sampled according to a uniform distribution, | ||
+ | |||
+ | ===== Configuration ===== | ||
+ | By default, Spark' | ||
+ | < | ||
+ | export SPARK_LOCAL_DIRS=/ | ||
+ | </ | ||
+ | |||
+ | ===== Further reading ===== | ||
+ | You can find a more in-depth tour on the Spark architecture, | ||
+ | |||
+ | --- // |