Problem

You’ve setup a local Spark 3.5.3 cluster for development user the bitnami docker images and compose file, you’ve hooked up your PySpark SparkSession and then when you try to df.limit(10).toPandas().head() for the first time, you are greeted with the following disheartening traceback:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 29
     24 # display() here will just do the jupyter rendir, which is super limited
     25 #display(df)
     26 # df.show() slightly better
     27 df.show()
---> 29 df.limit(10).toPandas().head()

File ~/.virtualenvs/level3databricks/lib/python3.12/site-packages/pyspark/sql/pandas/conversion.py:86, in PandasConversionMixin.toPandas(self)
     83 from pyspark.sql.pandas.types import _create_converter_to_pandas
     84 from pyspark.sql.pandas.utils import require_minimum_pandas_version
---> 86 require_minimum_pandas_version()
     88 import pandas as pd
     90 jconf = self.sparkSession._jconf

File ~/.virtualenvs/level3databricks/lib/python3.12/site-packages/pyspark/sql/pandas/utils.py:24, in require_minimum_pandas_version()
     21 # TODO(HyukjinKwon): Relocate and deduplicate the version specification.
     22 minimum_pandas_version = "1.0.5"
---> 24 from distutils.version import LooseVersion
     26 try:
     27     import pandas

ModuleNotFoundError: No module named 'distutils'

Analysis

In short, the bitnami images ship with Python 3.12, which means you have to run Python 3.12 on your driver machine as well, but unfortunately PySpark 3.5.3 relies on the deprecated distutils package to check if the required dependencies, in this case pandas, meet the minimum requirements.

distutils was deprecated since Python 3.10, and completely removed in Python 3.12.

After spending a few minutes successfully setting up a dummy distutils package in my project, I decided to check what was available online, and fortunately landed on the Python dead batteries redistribution aka python-deadlib project, the fashionably late star of this post.

Solution

In short, python-deadlib is a collection of deprecated packages that you can install as drop-in replacements, when your situation requires this. Seeing that we will have to wait for PySpark 4.0 for distutils dependency removal, this was just such a situation.

TLDR

In my case, a quick

1
pip install standard-distutils`

or rather

1
uv add standard-distutils

was all that was required to resolve the Python 3.12 vs PySpark 3.5.3 distutils dependency issue.