Apache Spark Application Execution Mode | Apache Spark

Apache Spark is a powerful processing platform for big data applications that supports different big data processing types. In this article we will discover together how Apache Spark application can be executed in multiple modes, depends on the environment architecture and on the application requirements.

Before going into details, if you would like to setup Apache Spark environment on your Windows machine, check our previous article

Cluster Mode

In Cluster mode user submit packaged application file to the cluster, then cluster manager launches driver process then launch executor processes on cluster worker nodes, cluster manager maintains the application from start to finish, then when application completed successfully or with some failures cluster manager store application status, and we can check our application status from cluster monitoring interface, following is a diagram to describe steps and communications between different parties in this mode.

Application execution will go as following:

1- User packages spark application and submit it to cluster manager using spark-submit

2- Cluster manager launches driver process on one of the worker nodes

3- Driver process starts executing application code, as we know SparkSession is the entry point of spark application to the cluster, the driver will start SparkSession

4- SparkSession communicates with the cluster manager to allocate required resources and launch number of executors as requested by user submission

5- Cluster manager launches executor processes as requested and send the location of all executors (worker nodes connection details) to the driver process

6- Driver process starts communicating with executor’s processes and application start moving data around and physically execute our submitted application

Note: In this mode, client status will not affect running applications as it is fully managed by the cluster manager and all processes are running on the cluster.

Client Mode

In client mode, user submit packaged application file, driver process started locally on the machine from which the application submitted, driver process starts with initiating SparkSession which communicates with the cluster manager to allocate required resources, following is a diagram to describe steps and communications between different parties in this mode:

1- User packages spark application and submit it in client mode using spark-submit

2- Driver process starts on the local machine from which the application submitted

3- Driver process starts executing application code, SparkSession is the entry point of spark application to the cluster, the driver will start SparkSession

4- SparkSession communicates with the cluster manager to allocate required resources and launch number of executors as requested by user submission

5- Cluster manager launches executor processes as requested and send the location of all executors (Worker nodes connection details) to the driver process

6- Driver process starts communicating with executor’s processes on worker nodes and application start moving data around and physically execute our submitted application

Note: In this mode, the client will be fully responsible for managing applications from start to finish, one of the important performance factors for this mode is the network connectivity stability and speed between client mode and cluster nodes.

Local Mode

This mode has the same general behavior as the previous modes except that all processes are running as local threads on the local machine, and spark clusters will be entirely co-locating in the same machine.

Submit Command

The following is the command syntax to submit your spark application in different modes

./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> <application-jar> [application-arguments]
PropertyDescription
main-class(Required). Name of the main class in your application
master-url(Required). URL of the cluster on which your application will execute on, this cluster could be Yarn, Mesos, Kubernetes, Spark Standalone, or local
spark://Host:Port
mesos://Host:Port
deploy-mode(Required). Cluster, Client, Local
conf <key>=<value>(Required). extra configurations such as a number of executors, executors memory,…etc.
application-jar(Required).Your application JAR file
application-arguments (Optional). Any arguments needed as input to your application

For full description of command attributes and other extra options, please check Apache Spark documentation

http://spark.apache.org/docs/latest/submitting-applications.html

Ahmed Ibrahem

Ahmed Ibrahem is working as Data Engineering Team lead, with a wide experience in data management projects and technical implementation using different technologies, and delivering end to end projects starting from business analysis to Data Warehouse modeling and implementation to BI design to customers in different industries.

guest
0 Comments
Inline Feedbacks
View all comments