Understanding Cluster for Spark

Get the most out of your cluster using Spark configurations

When we get hardware for our Spark project, we always wonder, what should be the ideal configuration while submitting the job, as per the hardware provided. To have better performance, there are two most important factor, we should consider.

  1. CPU/Cores

  2. Memory

Let's discuss a usecase and find out what should be the value of

  1. No. of executor for each node

  2. Total No. of executor instances i.e. --num-executors or spark.executor.instances

  3. No. of Cores for each executor i.e. --executor-cores or spark.executor.cores

  4. Memory for each executor i.e. --executor-memory or spark.executor.memory - default is 1g

Let's visualize it ...

Relation between Node, executor, Core, Task

Node-executor-core-task

Memory

memory

Now, let's assume , we got hardware of below configuration

  1. No. of nodes = 6

  2. No. of cores in each Node = 16 Core

  3. RAM in each Node = 64 GB

hardware

Now let's calculate what should be the ideal configuration for your spark job

Point to remember:

  1. YARN Application master (AM) needs one executor

  2. Keep 1 core for daemon process in each node

  3. Ideally, No. of core per executor should be 5

  4. From each node, 1 GB RAM will be dedicated for system related process

So, from above,

Applying rule 3

There should be 5 cores per executor i.e. --executor-cores or spark.executor.cores = 5 (Ans. of 3)

Let's apply rule 2 :

Cluster has (16-1)x6 = 90 cores

That means, No. of executor in the cluster = 90/5 = 18

Applying rule 1 :

Among 18, one executor for AM i.e. Actual No. of executors = 18 - 1 = 17

--num-executors or spark.executor.instances = 17 (Ans. of 2)

Then, each node can have 17/6 ~= 3 executors (Ans. of 1)

spark-compute-calculation

Let's calculate memory

Among 64GB RAM from each node, 1 GB RAM will be for system related tasks

So, left out RAM per each node = 64-1 = 63 GB

There are 3 executor per each node, So memory per executor = 63/3 = 21 GB RAM

But, there will be 7% overhead, hence, the memory left = 21 x (1 - 0.7) =~= 19GB RAM

--executor-memory or spark.executor.memory = 19GB (Ans. of 4)

This is the configuration if you run one job in your cluster. But in practical scenario, you need to run multiple job. So accordingly divide among no. of jobs you are planning to run at a time.

This article gives you overal idea to calculate the resource available in your cluster.

Play accordingly.

Hope it helps :)

Last updated

Was this helpful?