Unit test pyspark code using python

Here's a lightweight way to test your function. You don't need to download Spark to run PySpark tests like the accepted answer outlines. Downloading Spark is an option, but it's not necessary. Here's the test:

import pysparktestingexample.stackoverflow as SO
from chispa import assert_df_equality
import pyspark.sql.functions as F

def test_column_names(spark):
    source_data = [
        ("jose", "oak", "switch")
    ]
    source_df = spark.createDataFrame(source_data, ["some first name", "some.tree.type", "a gaming.system"])

    actual_df = SO.column_names(source_df)

    expected_data = [
        ("jose", "oak", "switch")
    ]
    expected_df = spark.createDataFrame(expected_data, ["some_&first_&name", "some_$tree_$type", "a_&gaming_$system"])

    assert_df_equality(actual_df, expected_df)

The SparkSession used by the test is defined in the tests/conftest.py file:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope='session')
def spark():
    return SparkSession.builder \
      .master("local") \
      .appName("chispa") \
      .getOrCreate()

The test uses the assert_df_equality function defined in the chispa library.

Here's your code and the test in a GitHub repo.

pytest is generally preferred in the Python community over unittest. This blog post explains how to test PySpark programs and ironically has a modify_column_names function that'd let you rename these columns more elegantly.

def modify_column_names(df, fun):
    for col_name in df.columns:
        df = df.withColumnRenamed(col_name, fun(col_name))
    return df

Pyspark Unittests guide

1.You need to download Spark distribution from site and unpack it. Or if you already have a working distribution of Spark and Python just install pyspark: pip install pyspark

2.Set system variables like this if needed:

export SPARK_HOME="/home/eugene/spark-1.6.0-bin-hadoop2.6"
export PYTHONPATH="$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
export PATH="SPARK_HOME/bin:$PATH"

I added this in .profile in my home directory. If you already have an working distribution of Spark this variables may be set.

3.Additionally you may need to setup:

PYSPARK_SUBMIT_ARGS="--jars path/to/hive/jars/jar.jar,path/to/other/jars/jar.jar --conf spark.driver.userClassPathFirst=true --master local[*] pyspark-shell"
PYSPARK_PYTHON="/home/eugene/anaconda3/envs/ste/bin/python3"

Python and jars? Yes. Pyspark uses py4j to communicate with java part of Spark. And if you want to solve more complicated situation like run Kafka server with tests in Python or use TestHiveContext from Scala like in the example you should specify jars. I did it through Idea run configuration environment variables.

4.And you could to use pyspark/tests.py, pyspark/streaming/tests.py, pyspark/sql/tests.py, pyspark/ml/tests.py, pyspark/mllib/tests.pyscripts wich contain various TestCase classes and examples for testing pyspark apps. In your case you could do (example from pyspark/sql/tests.py):

class HiveContextSQLTests(ReusedPySparkTestCase):

    @classmethod
    def setUpClass(cls):
        ReusedPySparkTestCase.setUpClass()
        cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
        try:
            cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
        except py4j.protocol.Py4JError:
            cls.tearDownClass()
            raise unittest.SkipTest("Hive is not available")
        except TypeError:
            cls.tearDownClass()
            raise unittest.SkipTest("Hive is not available")
        os.unlink(cls.tempdir.name)
        _scala_HiveContext =\
            cls.sc._jvm.org.apache.spark.sql.hive.test.TestHiveContext(cls.sc._jsc.sc())
        cls.sqlCtx = HiveContext(cls.sc, _scala_HiveContext)
        cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
        cls.df = cls.sc.parallelize(cls.testData).toDF()

    @classmethod
    def tearDownClass(cls):
        ReusedPySparkTestCase.tearDownClass()
        shutil.rmtree(cls.tempdir.name, ignore_errors=True)

but you need to specify --jars with Hive libs in PYSPARK_SUBMIT_ARGS as described earlier

or without Hive:

class SQLContextTests(ReusedPySparkTestCase):
    def test_get_or_create(self):
        sqlCtx = SQLContext.getOrCreate(self.sc)
        self.assertTrue(SQLContext.getOrCreate(self.sc) is sqlCtx)

As I know if pyspark have been installed through pip, you haven't tests.py described in example. In this case just download the distribution from Spark site and copy code examples.

Now you could run your TestCase as a normal: python -m unittest test.py

update: Since spark 2.3 using of HiveContext and SqlContext is deprecated. You could use SparkSession Hive API.