PySpark Unit Test Best Practices
In our latest Craft post, we will look at PySpark unit tests.
First, let’s go over what it’s all about. PySpark unit tests allow you to test code by checking the behavior of each part individually. Pytest is a test library that makes devising and running unit tests in PySpark easier. Unit tests are important because they help you find code errors quickly and make sure the code produces correct and consistent results. Unit tests also help ensure code quality and reliability.
Do You Need to be a Professional Tester?
No, don’t worry. This post is not aimed at professional testers but at all data profiles who want to improve the quality of their code in Databricks/PySpark. This post shows you how to include unit tests in Data projects by providing ready-to-use tools instead of writing complicated code and starting from scratch. We’ll look at this in more detail in the subsequent sections. However, for now, know there are already several tools available for performing unit tests.
What Are the Requirements?
First, you need a separate test environment to ensure the testing phase happens upstream of the production environments.
So, we’ll assume that the test environment continuous integration (Ci)/continuous delivery (CD) pipelines are already in place. Then we’ll need these tools:
- A Python and Spark integrated development environment (IDE), such as VS Code
- The pytest test library, which provides a testing framework in Python
How to Avoid Confusing Unit Tests and Debugging of Databricks Notebooks?
Put simply, debugging means finding and fixing errors in the code. Unit tests are small segments of code that check whether each part of the code works correctly.
So, debugging consists of looking for errors, which we usually do with Databricks notebooks. Unit tests, on the other hand, make sure that everything is working as expected.
In our case, we’d like to test the functions coded in the notebooks and make sure they work as expected before deploying them. This will let us know if there has been a regression during the changes that did not affect what was already there.
To do this, we will convert what we build and test in the notebooks into Python scripts with functions.
Ultimately, the functions converted into scripts we have just created will be put through unit tests.
Where to Start?
The first step will be to start our Python project and organize it into folders or even subfolders. This will enable us to find our way around easily throughout the rest of the steps.
- .cicd: this folder contains the definition of the CI/CD pipelines
- yml: as we’ll see in the last section of this post, this file will tell us how to run our unit tests within the CI
- yml: these are deployment templates, but they do not apply in our scenario because the unit tests will be built into the CI
- yml: in this file, we will define how to run the unit tests and their coverage, and then how to publish the results, in the CI
- Template_component: in this file, which is at the same level as the .cicd file, we’ll define the functions and data to be used in the tests and their results
The first thing you will notice is several __init__.py files in the “Template” folder. These are the configuration files Python needs to take the folders into account when generating the wheels (a wheel is the Python deliverable, like jars in Java or DLLs in .NET).
This folder will contain the test and conversion functions we will code in the following sections of this post. This folder is based on the processing zone we are referring to: bronze, silver, or gold. So, for example, in the silver folder, we’ll find the functions and unit tests we set up and added to the silver processing zone.
Before we write our unit tests, we’ll create a SparkSession we can reuse in all of them. We then create pytest fixtures in the conftest.py file.
What Is a Fixture?
Fixtures are functions in the Pytest library that let us manage states and dependencies for our apps. When explicitly called by our test software, they are useful for providing test data and different value types. The simulated data that fixtures create can be used for multiple tests. This is extremely helpful for complex objects like the SparkSession, which take a long time to make.
The code below is an example of how to use the pytest library to configure a SparkSession that can be reused.
This session is configured to store all Delta tables in a temporary directory temporarily. As mentioned earlier, we do this using pytest fixtures, which create objects once and reuse them in multiple tests.
Our first example is a fixture that creates the SparkSession for all unit tests and another fixture that retrieves the temporary directory where all the Delta tables are stored:
The last example is another fixture that looks for a folder with the same name as the test module. If it finds one, it moves all of its contents to a temporary directory for the tests to use freely:
But How Do We Use the Fixtures Once They Are Defined?
Now we come to the meat of the matter 😊. The next step is to write some unit test functions.
In the function parameters, we will call the fixtures we created earlier, such as the SparkSession. Below are some examples of functions we want to test every time we run a test case.
A first test, test_return_first_column_of_df(), is a function that takes a Spark DataFrame with multiple columns as input and returns a Spark DataFrame with only the first column. The test compares the DataFrame returned by the function with a DataFrame in the expected format:
A second example of a function that converts a string in the YYYYMMDDhhmmss format into a datetime object. The test compares the value returned by the function with an expected value.
An exception is thrown when the string is not in the expected format:
The third test is test_write_delta_table(). We test a function that writes a Spark DataFrame to a Delta table. It then reads the table and compares the result to an expected DataFrame. The fixtures we defined in the previous section provide the Delta table path and SparkSession as parameters.
But What Data Should Be Used for These Tests and in What Format?
That’s easy: we define the data directly in the tests for straightforward scenarios like the date example above.
For DataFrame-based tests, it’s best to use files with simple extensions like .CSV or .JSON. These are easier for a simple user to read and manipulate. We must not forget that unit tests are used to ensure the code keeps working correctly as we make changes. This means they should be as simple and quick to run as possible.
And How Do We Run These Unit Tests Now That Everything Is Set Up?
We can run them locally in the IDE and observe the unit test results and coverage. In the screenshot below, you can see four unit tests. Two passed, and two failed. When we click on the failed tests, we can see the failure logs for those tests:
The aim is to have these unit tests run automatically and add them to the CI/CD process. In this section, we’ll set up the CI to include a test stage.
We’ll create a test_component.yml file. This will be the configuration file for the project’s unit tests.
First, we declare the parameters to be used in the pipeline:
Then, we define a first job that specifies which Python version to use and where to find the source code. After that, we install the dependencies listed in “requirements.txt:”
The job then runs unit tests using the pytest library and publishes the test and coverage results:
In requirement.txt, we list the libraries we need for our project and tests:
Once the unit tests have been integrated into the CI, they will run automatically when the CI pipeline starts.
We can still run the tests locally in the IDE and observe the unit test results and coverage.
We can see from the execution results that there were seven tests. The first one was successful, lines 4 to 64 were missing from the second and were not tested, etc.
Find Out More About the Craft
Want to learn more about Craftsmanship? Read our Craft Month posts:
- Is the Craft Still Relevant?
- How to Choose the Best Software Architecture with Architectural Drivers?
- How to Build an Infrastructure with Terraform?
- Craft and PowerShell: Why Software Engineering Practices Need to Be Applied to Infrastructure
- Telemetry: Ensuring Code That Works
- How to Boost Your Apps’ Performance with Asyncio: A Practical Guide for Python Developers