Setup Apache Spark environment on Windows | Apache Spark

Apache Spark is easy to use, unified platform for all purposes of big data processing, and equipped with rich set of APIs for different application needs as Spark DataFrame and Spark SQL for structured data processing, Spark Streaming and Structured Streaming for streaming applications, Spark MLib for machine learning applications, Spark Graphx for Graph analytics applications with the ability to combine all these APIs seamlessly in the same application without the headache of integration complexity. Apache Spark also provides flexibility in deployment over different cluster management systems such as YARN, Mesos, Spark Standalone cluster, or Kubernetes. Apache Spark provides language API for Java, Scala, Python, and R.

In the following steps we will go through step by step to setup Apache Spark environment on Windows environment.

JDK must be installed to setup Apache Spark with Scala, so please make sure both are installed and ready to be used, you can test java installation by type command java-version in any terminal and you should get out as following


Part One: Install and Configure Apache Spark

1- first we will start by downloading spark, we will download Apache Spark 2.4.4 go to URL https://spark.apache.org/downloads.html choose Spark version 2.4.4 for Hadoop 2.7 then click on the file name with tgz extension

2- Web page will open click on highlighted mirror to start file download

3- Extract the file into any directory you choose

4- Save the directory name where you extracted Apache Spark source because we will use it in coming steps

5- Spark requires Hadoop, so next step we will install a file that simulate Hadoop installation, you can get this file from https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe , click on Download

6- Create a directory and create inner directories as following
<new folder>\winutils\bin
Then copy downloaded file from step 5 into the bin directory

7- Next step we need to add environment variables to define SPARK_HOME, HADOOP_HOME and include
spark and Hadoop bins to the PATH variable
Following instructions is for windows, you can set the same variables for linux and Mac environments by setting these variables in
bash_profile file, or the user profile

8- Go to control panel, click on System and Security

9- Click on System

10- Click on Advanced System Settings

11- Click on Environment Variables

12- Click on New to add a new environment variable, then enter variable name as SPARK_HOME, then enter directory of extracted file of Spark used in step 4, then click Ok

Repeat the same step to add HADOOP_HOME with directory from step 6

13- In environment variables window search for Path variable then click on edit

14- Click on New to add a new entry to the Path variable

15- Add two entries
%HADOOP_HOME%/bin
%SPARK_HOME%/bin

16- To test your setup so far open any command prompt and type spark-shell then hit enter a spark shell should start, run a test command to create variable x with initial value 8 then type x+3 to add 3 to variable x your terminal should be like the following

17- Now environment is ready, next step is to setup our IDE we will use to run our codes through this book, we will install Intellij community edition it is free and requires to run Scala code, and we will install Pycharm to run python codes in the following steps you will learn how to install, configure and setup spark project on both IDEs

Part Two: Setup Intellij Apache Spark Project

18- Go to Intellij website and download intellij for windows, Community edition will be enough for us to work with and setup Apache Spark projects
https://www.jetbrains.com/idea/download/#section=windows

19- After installation process finishes, you will be asked to install plugins as part of your installation, choose Scala Plugin to install

In case you already have Intellij installed, and you need you need to install Scala plugin, go to File menu from any project you have, then click on Settings

Click on Plugins then search for Scala Plugin under Marketplace, then click on Install

20- Click on Create New Project, then choose Scala project, and sbt scala project

21- Choose Project name, location, JDK, sbt version, Scala version, then click Finish

22- It will take few minutes to setup the project and it will be ready for us to start working with our project

23- Next we need to add dependencies to our build.sbt file, in this file we include all Scala dependencies we need in our project, so now go to maven dependencies website and copy SBT librarydependecny statement
In this project we need to add Spark Core, and SparkSQL libraries, choose the library dependency that matches your Scala and sbt version you selected while you creating the project
https://mvnrepository.com/artifact/org.apache.spark/spark-core
https://mvnrepository.com/artifact/org.apache.spark/spark-sql

24- Now double click on build.sbt file from our project and paste the library dependency statements

25- Right-Click on Scala folder, then New, then Scala Class, then select Object and enter the name for the Object you need to create

26- Now we are ready to write our first application, then execute it as application, in this application we created a SparkSession and created a simple variable for testing

Note: in case you cannot find Scala Class as an option when you right-click on Scala folder, follow the new two steps to add Scala Framework support to your project
1- Write click on project name then choose Add Framework Support
2- click on Scala check box then click on Finish

Ahmed Ibrahem

Ahmed Ibrahem is working as Data Engineering Team lead, with a wide experience in data management projects and technical implementation using different technologies, and delivering end to end projects starting from business analysis to Data Warehouse modeling and implementation to BI design to customers in different industries.

guest
0 Comments
Inline Feedbacks
View all comments