PySpark setup, common errors Py4JError and StorageUtils$

Updated: 2023-04-23

A strong background in Java can be useful working on Big Data projects in Python. PySpark in fact is translating instructions from Python to Scala and runs them in a JVM runtime.

Common errors

StorageUtils$ error

If you get this error

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$ 

Probably your Java version is not supported by Spark.

Spark 3.2.0 doesn't officially support Java 17. The official support only comes with Spark 3.3.0

Here you can see the changes and their status done in the Spark project to be compatible with Java 17 ... and even with Spark 3.3 you risk to have some issues running your program.

To avoid the runtime issues the Spark team added a launcher to use with Java 17:

JavaModuleOptions.java

In some forums developers tried to fix the modules issue adding to the jvm options this instruction:

--add-exports java.base/sun.nio.ch=ALL-UNNAMED 

In our case, because of the critical importance of the application we prefer to be safe and we stayed with Java 11 as runtime for this application.

We won't change until when there will be a widely adopted / official solution.

Py4JError

Another issue we found is the different version between pyspark and spark.

py4.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.SparkSession. Trace: 

In this case whe had to be sure to use the same version of PySpark and Spark, 3.2.2 in our case.

To test your PySpark version you can use the command

pyspark --version 

The result should be similar to this one:

>pyspark --version 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.2 
      /_/ 
  
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.18 
Branch HEAD 

Start the standalone Spark cluster on Localhost

Spark is thought to be a distributed system to test in a local environment, for local development you can start a standalone server with the command:

spark-class org.apache.spark.deploy.master.Master 

If everything works correctly you should be able to see the version of Spark e.g.:

23/04/23 17:10:12 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 
23/04/23 17:10:12 INFO Master: Starting Spark master at spark://192.168.1.1:7077 
23/04/23 17:10:12 INFO Master: Running Spark version 3.2.2 
23/04/23 17:10:13 INFO Utils: Successfully started service 'MasterUI' on port 8080. 
23/04/23 17:10:13 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://host.docker.internal:8080 
23/04/23 17:10:13 INFO Master: I have been elected leader! New state: ALIVE 

You could be interested in

logo of the category: python-logo.svg PySpark connection to PostgreSQL ... errors and solutions

How to connect PySpark to PostgreSQL
2023-01-03

Enums in Angular templates

How to use enum in the html template
2019-01-21
WebApp built by Marco using SpringBoot 3.2.4 and Java 21, in a Server in Switzerland