Last modified: Apr 03, 2025 By Alexander Williams

How to Install PySpark in Python Step by Step

PySpark is the Python API for Apache Spark. It helps process large datasets. This guide will show you how to install PySpark easily.

Prerequisites for Installing PySpark

Before installing PySpark, ensure you have these:

1. Python 3.6 or higher installed.

2. Java 8 or 11 installed on your system.

3. Basic knowledge of Python and pip.

Check your Python version using python --version. For Java, use java -version.


python --version
# Output: Python 3.9.7

java -version
# Output: openjdk version "11.0.12"

Step 1: Install PySpark Using pip

The easiest way to install PySpark is via pip. Run this command:


pip install pyspark

This will download and install PySpark and its dependencies.

Step 2: Verify PySpark Installation

After installation, verify it works. Open Python and try importing PySpark:

 
import pyspark
print(pyspark.__version__)
# Output: 3.3.1

If you see a version number, PySpark is installed correctly.

Step 3: Set Up Java Home (If Needed)

PySpark needs Java. If you get Java errors, set JAVA_HOME:


export JAVA_HOME=/path/to/java

Replace "/path/to/java" with your Java installation path.

Step 4: Create a Simple PySpark Application

Test PySpark with a simple script. Create a file named test_pyspark.py:

 
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("TestApp").getOrCreate()

# Create a simple DataFrame
data = [("Alice", 1), ("Bob", 2)]
df = spark.createDataFrame(data, ["Name", "Value"])

# Show the DataFrame
df.show()

Run the script with:


spark-submit test_pyspark.py

Common Installation Issues

If you face ModuleNotFoundError, see our guide on solving ModuleNotFoundError.

Other common issues include:

1. Java not installed or wrong version.

2. Python version too old.

3. Network issues during pip install.

Alternative Installation Methods

You can also install PySpark via Conda:


conda install -c conda-forge pyspark

Or download directly from Apache Spark website.

Conclusion

Installing PySpark is simple with pip. Always verify the installation. Set JAVA_HOME if needed. Now you're ready for big data processing with PySpark!

For more complex setups, check the official PySpark documentation. Happy coding!