PySpark setup, common errors Py4JError and StorageUtils$
A strong background in Java can be useful working on Big Data projects in Python. PySpark in fact is translating instructions from Python to Scala and runs them in a JVM runtime.
Common errors
StorageUtils$ error
If you get this error
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
Probably your Java version is not supported by Spark.
Spark 3.2.0 doesn't officially support Java 17. The official support only comes with Spark 3.3.0
Here you can see the changes and their status done in the Spark project to be compatible with Java 17 ... and even with Spark 3.3 you risk to have some issues running your program.
To avoid the runtime issues the Spark team added a launcher to use with Java 17:
In some forums developers tried to fix the modules issue adding to the jvm options this instruction:
--add-exports java.base/sun.nio.ch=ALL-UNNAMED
In our case, because of the critical importance of the application we prefer to be safe and we stayed with Java 11 as runtime for this application.
We won't change until when there will be a widely adopted / official solution.
Py4JError
Another issue we found is the different version between pyspark and spark.
py4.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.SparkSession. Trace:
In this case whe had to be sure to use the same version of PySpark and Spark, 3.2.2 in our case.
To test your PySpark version you can use the command
pyspark --version
The result should be similar to this one:
>pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.18
Branch HEAD
Start the standalone Spark cluster on Localhost
Spark is thought to be a distributed system to test in a local environment, for local development you can start a standalone server with the command:
spark-class org.apache.spark.deploy.master.Master
If everything works correctly you should be able to see the version of Spark e.g.:
23/04/23 17:10:12 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
23/04/23 17:10:12 INFO Master: Starting Spark master at spark://192.168.1.1:7077
23/04/23 17:10:12 INFO Master: Running Spark version 3.2.2
23/04/23 17:10:13 INFO Utils: Successfully started service 'MasterUI' on port 8080.
23/04/23 17:10:13 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://host.docker.internal:8080
23/04/23 17:10:13 INFO Master: I have been elected leader! New state: ALIVE