PySpark setup, common errors Py4JError and StorageUtils$

Updated: 2023-04-23

Marco Molteni

A strong background in Java can be useful working on Big Data projects in Python. PySpark in fact is translating instructions from Python to Scala and runs them in a JVM runtime.

Common errors

StorageUtils$ error

If you get this error

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$

Probably your Java version is not supported by Spark.

Spark 3.2.0 doesn't officially support Java 17. The official support only comes with Spark 3.3.0

Here you can see the changes and their status done in the Spark project to be compatible with Java 17 ... and even with Spark 3.3 you risk to have some issues running your program.

To avoid the runtime issues the Spark team added a launcher to use with Java 17:

JavaModuleOptions.java

In some forums developers tried to fix the modules issue adding to the jvm options this instruction:

--add-exports java.base/sun.nio.ch=ALL-UNNAMED

In our case, because of the critical importance of the application we prefer to be safe and we stayed with Java 11 as runtime for this application.

We won't change until when there will be a widely adopted / official solution.

Py4JError

Another issue we found is the different version between pyspark and spark.

py4.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.SparkSession. Trace:

In this case whe had to be sure to use the same version of PySpark and Spark, 3.2.2 in our case.

To test your PySpark version you can use the command

pyspark --version

The result should be similar to this one:

>pyspark --version 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.2 
      /_/ 
  
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.18 
Branch HEAD

Start the standalone Spark cluster on Localhost

Spark is thought to be a distributed system to test in a local environment, for local development you can start a standalone server with the command:

spark-class org.apache.spark.deploy.master.Master

If everything works correctly you should be able to see the version of Spark e.g.:

23/04/23 17:10:12 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 
23/04/23 17:10:12 INFO Master: Starting Spark master at spark://192.168.1.1:7077 
23/04/23 17:10:12 INFO Master: Running Spark version 3.2.2 
23/04/23 17:10:13 INFO Utils: Successfully started service 'MasterUI' on port 8080. 
23/04/23 17:10:13 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://host.docker.internal:8080 
23/04/23 17:10:13 INFO Master: I have been elected leader! New state: ALIVE