I. Introduction into Spark

II. Spark Architecture: Spark consists of three parts:

1-Driver Program: The application has SparkContext. Inside SparkContext, we run the RDD (Resilient Distributed Data). the RDD will send the data to different WorkerNodes

2-Cluster Manager: It’s a daemon is responsible for allocating the Work Nodes to the Driver program. Also the Cluster Manager is responsible of dividing the job to different tasks. The Cluster Manager makes sure that the Worker Node is active.

3-Worker Node: Each Worker Node has one executor and more (depends on the number of instances). This node is responsible of execution the tasks.

The diagram above shows that the Driver program, Cluster Manager and three Worker Nodes. Each Worker Node has Executor. In each executor has two tasks.

III. Execution Spark Methods

A. pyspark or spark-shell: In spark command line has two languages. We can write the code in Scala or Python.

B. spark-submit

IV. Spark components

On the top of spark another components were built to allow the developers to use them and to make the development cycle easier for spark.

The main components of Spark:

A. Apache Spark: The core of Spark. This core has API’s to communicate of another components. RDD is part of this core.

B. Spark SQL:

C. Spark Streaming:

D. Spark ML (Machine learning):

E. Graph X:

V. Spark Memory

Spark supports two models of memory management: Static Memory Manager and Unified Memory Manager.

A. Static Memory Manager: The size of Storage memory, Execution memory, and other memory is fixed during the Spark application’s operation
B. Unified Memory Manager: The Storage memory and Execution memory share a memory area, and both can occupy each other’s free area

1-On-heap model:

IT’s a default by Spark. The size of the On-heap memory is configured by the –executor-memory or spark.executor.memory parameter. The objects in on heap should stay in the same JVM heap and bound by GC (garbage collection).

Storage Memory: It’s for Spark cache data (such as RDD cache) and Broadcast variables.
Execution Memory: It’s for temporary data in the calculation process of Shuffle, Join, Sort, Aggregation
User Memory: It’s mainly used to store the data needed for RDD conversion operations
Reserved Memory: The memory is reserved for system (for internal use)

2-Off-heap model:

Spark 1.6 began to introduce Off-heap memory . The Off-heap is false by default. The spark.memory.offHeap.enabled parameter, and set the memory size by spark.memory.offHeap.size parameter. The Off-heap memory is consist of two components Storage memory and Execution memory. In the case of off-heap is enabled. The system will use on-heap and off-heap models. The objects are allocated in memory outside the JVM by serialization. Also we have to write the logic of memory allocation and memory release.

3-Dynamic occupancy mechanism:

It’s true by default. if one of its space is insufficient but the other is free, then it will borrow the other’s space. The Spark variable is spark.dynamicAllocation.enabled .

VI. Transformations & Actions

Transformation: It’s the process to transfer data from one RDD to another RDD with some calculation. Example: Map & Flat Map

  • map
  • flatMap
  • filter
  • union
  • distinct
  • mapPartitions
  • mapPartitionWithIndex
  • groupByKey
  • reduceByKey
  • sortByKey
  • join
  • coalesce

Action: When the action trigger, the system will start to execute the transformation function and put the final result in new RDD or save it external.

  • count
  • collect
  • take(n)
  • reduce()
  • fold()
  • aggregate()
  • foreach()

We can divide the diagram above to the three steps:

  • First: read the file from file to the array of lines
  • Second: add transformation to create a new RDD lines by filter the line that contains “Sprk”
  • Third: Run the action lines.count or lines.first to execute all the steps and display the output

VII. Narrow and Wide dependencies with partitions

Narrow: The number of partition will stay same after we run the transformation.

  • map
  • filter
  • flatMap

Wide: The number of partition will change after the transformation.

  • reduceByKey
  • groupByKey
  • repartition
  • join

Example:

sc.textFile(“file.txt”).flatMap(lines -> lines.split(” “)).map(word-> (word,1)).reduceByKey((x,y) -> x+y).collect()

The diagram above show the narrow dependencies in stage 1 and wide dependencies in stage 2. When the partition change, the new stage will start. Also the number of partition is 4 in stage 1. In stage 2, the number of partition is 3.

VIII. Tasks and stages in DAG

When we read the data from file or array, we will keep the data in RDD with partitions. Spark system will execute these tasks in parallel.

DAGScheduler is the scheduling of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

The DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. 

Example:

val sfiRDD = sc.textFile(“input.txt”).map{ x => val xi = x.toInt; (xi,xi*xi) }
val spRDD = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spJoinRDD = sfiRDD.join(spRDD)
val sm = spJoinRDD.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile(“/out”)

Note: mapPartitions() provides for the initialization to be done once per worker.

IX. Caching and Persisting

Cached data is reused from memory for other transformations/actions down the processing DAG

cache is a subset of persist. The cache API is equivalent to the
persist(MEMORY_ONLY_SER) API. In fact, if the developer were to look at the source code, when
making the cache() call, it actually calls persist(MEMORY_ONLY_SER)

X. Checkpointing

XI. Spark SQL Architecture

Spark uses Catalyst(optimizer) to optimize all the queries written both in spark sql and dataframe. The Catalyst works at logical plan. Spark SQL uses Catalyst rules and a Catalog object that tracks the tables in all data sources to resolve these attributes.

After SQL, logical plan will start. The logical plan (Catalyst) is the optimization of the SQL query. After the optimization, physical plan will start (implementation code) to change to RDD.