Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
en:services:application_services:high_performance_computing:spark [2019/09/03 15:17]
ckoehle2 minor typo
en:services:application_services:high_performance_computing:spark [2020/11/10 14:15]
ckoehle2
Line 25: Line 25:
 </code> </code>
  
-In this case, the ''%%--%%nodes'' parameter has been set to specify a total amount of two worker nodes and ''%%--%%time'' is used to request a job runtime of two hours. The job ID is reported back. We can use it to inspect if the job is running yet and if so, on which nodes:+In this case, the ''%%--%%nodes'' parameter has been set to specify a total amount of two worker nodes and ''%%--%%time'' is used to request a job runtime of two hours. If you would like to set a longer  runtime, beside ''%%--%%time'', add ''%%--%%qos=normal'' parameter as well. The job ID is reported back. We can use it to inspect if the job is running yet and if so, on which nodes:
  
 <code> <code>
Line 66: Line 66:
 </code> </code>
  
-===== Example: Approximating PI =====+===== Example: Approximating Pi =====
  
 To showcase the capabilities of the Spark cluster set up thus far we enter a short [[https://spark.apache.org/examples.html|Scala program]] into the shell we’ve started before. To showcase the capabilities of the Spark cluster set up thus far we enter a short [[https://spark.apache.org/examples.html|Scala program]] into the shell we’ve started before.
Line 72: Line 72:
 {{ :en:services:application_services:high_performance_computing:spark:shell_example.png?nolink&800 |}} {{ :en:services:application_services:high_performance_computing:spark:shell_example.png?nolink&800 |}}
  
-The local dataset containing the integers from //1// to //1E9// is distributed across the executors using the parallelize function and filtered according to the rule that the random point //(x,y)// with //0 < x, y < 1// that is being sampled according to a uniform distribution, is inside the unit circle. Consequently, the ratio of the points conforming to this rule to the total number of points approximates the area of one quarter of the unit circle and allows us to extract an estimate for the number //PI// in the last line.+The local dataset containing the integers from //1// to //1E9// is distributed across the executors using the parallelize function and filtered according to the rule that the random point //(x,y)// with //0 < x, y < 1// that is being sampled according to a uniform distribution, is inside the unit circle. Consequently, the ratio of the points conforming to this rule to the total number of points approximates the area of one quarter of the unit circle and allows us to extract an estimate for the number //Pi// in the last line.
  
- --- //[[christian.koehler@gwdg.de|ckoehle2]] 2019/09/02 19:50//+===== Configuration ===== 
 +By default, Spark's [[https://spark.apache.org/docs/latest/configuration.html#application-properties|scratch space]] is created in ''/tmp/$USER/spark''. If you find that the ''2G'' size of the partition where this directory is stored is insufficient, you can configure a different directory, for example in the ''scratch'' filesystem, for this purpose before deploying your cluster as follows: 
 +<code> 
 +export SPARK_LOCAL_DIRS=/scratch/users/$USER 
 +</code> 
 + 
 +===== Further reading ===== 
 +You can find a more in-depth tour on the Spark architecture, features and examples (based on Scala) in the [[https://info.gwdg.de/wiki/doku.php?id=wiki:hpc:parallel_processing_with_spark_on_gwdg_s_scientific_compute_cluster|HPC wiki]]. 
 + 
 + --- //[[christian.koehler@gwdg.de|ckoehle2]] 2020/11/10 13:59//