Python Deadlib for Deprecated Libraries like distutils
Contents
Problem
You’ve setup a local Spark 3.5.3 cluster for development user the bitnami docker images and compose file, you’ve hooked up your PySpark SparkSession
and then when you try to df.limit(10).toPandas().head()
for the first time, you are greeted with the following disheartening traceback:
|
|
Analysis
In short, the bitnami images ship with Python 3.12, which means you have to run Python 3.12 on your driver machine as well, but unfortunately PySpark 3.5.3 relies on the deprecated distutils
package to check if the required dependencies, in this case pandas, meet the minimum requirements.
distutils
was deprecated since Python 3.10, and completely removed in Python 3.12.
After spending a few minutes successfully setting up a dummy distutils
package in my project, I decided to check what was available online, and fortunately landed on the Python dead batteries redistribution aka python-deadlib project, the fashionably late star of this post.
Solution
In short, python-deadlib is a collection of deprecated packages that you can install as drop-in replacements, when your situation requires this. Seeing that we will have to wait for PySpark 4.0 for distutils dependency removal, this was just such a situation.
TLDR
In my case, a quick
|
|
|
|
was all that was required to resolve the Python 3.12 vs PySpark 3.5.3 distutils dependency issue.