Pandas, Koalas and PySpark in Python
If you landed on this page to learn more about animals, I have to disappoint you. Pandas, Koalas and PySpark are all packages that serve a similar purpose in the programming language Python.
Python has increasingly gained traction over the past years, as illustrated in the Stack Overflow trends. Originally designed as a general purpose language, it is increasingly used in other areas such as web development (with frameworks such as Django), data science and machine learning.
In our daily operations as data-consultants, we use Python for automation, testing, analysis and predictions. In Spark environments, or Databricks if you’re in a cloud environment, nearly all transformations, orchestration, streaming and algorithms are written in Python.
To aid in the area of data science and data wrangling, numerous libraries and modules have been developed for Python. In this blogpost, I want to explore 3 types of libraries that are used to handle data in Python, being Pandas, Koalas and PySpark, with Koalas as new kid on the block.
PySpark and Pandas
Let’s first compare the Pandas and PySpark dataframes in the table below.
- Dataframes are value mutable, meaning that the values they contain can be altered. The length of a series however can not be changed, but columns can be inserted.
- Eager execution: this implies that operations are executed immediately as they are called.
- Single machine: Pandas runs on a single machine in-memory. When having a big dataset, this can become tricky.
- Pandas is restricted to a single machine and single thread, which can limit performance.
- Dataframes are immutable, meaning that when changes are made on the data, a new reference to the object is passed. Immutable data can more easily be shared across carious processes and threads.
- Lazy execution/evaluation: this implies that the execution is not started until an action is triggered. In lazy evaluation, data is not loaded until it is necessary.
- PySpark uses distributed dataframes over several nodes in-memory, which increases scalability.
- Due to its distributed nature, the performance can be scaled.
From above comparison, it is clear that PySpark is the way to go when working with big data.
When it comes to data science, Pandas is neatly integrated in Python ecosystem, with numerous other libraries such as Numpy, Matplotlib, Scikit-Learn and is able to handle a great variety of data wrangling methods (statistical analysis, data imputation, time series,…) . The counterpart of Scikit-Learn for the PySpark dataframes is Spark ML. Scikit-Learn is a bit more mature than Spark ML with respect to parametrization and algorithms, but Spark ML is rapidly catching up.
For people that are common with Pandas, PySpark dataframes can be a bit tedious, due to the (not so) subtle differences in syntax between the two libraries. As this blogpost is not about the differences between PySpark and Pandas syntax, I will not go into detail about this.
The below picture however already displays subtle differences between calculating the number of rows in both types of dataframes (the above examples) and on how to filter out certain rows (the below examples). These differences in syntax can make it hard for people to shift from Pandas to PySpark. This change however is necessary due to the increasing amount of data.
This is where Koalas comes into place, an open source project initiated by Databricks (but not limited to the Databricks infrastructure). Koalas implements the Pandas API on top of Apache Spark, hence allowing to use Pandas syntax while still benefiting from the distributed nature of Spark. The most common Pandas functions have been implemented in Koalas (e.g. plotting, series, seriesGroupBy,…).
Koalas works with an internal frame that can be seen as the link between Koalas and PySpark dataframe. This internal frame holds the current Spark Dataframe, alongside internal immutable metadata. It also manages the mappings between the Koalas column names and Spark column names.
One important addition to Pandas is that Koalas does allow you to run SQL queries against your data in the same way PySpark does. For people being familiar with SQL (which I assume the biggest part of people working with data is) this is a huge plus.
Lastly, data science in Koalas can be done using arbitrary MLflow models, provided they implement the ‘pyfunc’ flavor. This is the case for most frameworks supported by MLflow (Scikit-Learn, Pytorch, Tensforflow, …).
Are you eager to get started with the 3 types of libraries and see the differences yourself? This can easily be done with Azure Databricks. At the time of writing, free credits are given upon registration. You can spin up your cluster in seconds. In order to use Koalas library in the notebook, you need to install it. Both Pandas and PySpark come out of the box. To install Koalas, next to creating a new notebook in Databricks, you need to create a new library. You can do so by going to Workspace => Create => Library. The parameters can be filled in as depicted below.
When the library is created, you can see it in your workspace and attach it to the cluster. Importing the library is done by following command: ‘import databricks.koalas as ks’ .
Koalas is a useful addition to the Python big data system, since it allows you to seemingly use the Pandas syntax while still enjoying distributed computing of PySpark. For many people being familiar with Pandas, this will remove a hurdle to go into big data processing. When it comes to data science however, Pandas with Scikit-Learn is still the easiest way to go. When data volume increases, this is not always possible.
Big Data architect and Data Scientist
This blog is written by Tom Thevelein. Tom is an experienced Big Data architect and Data Scientist who still likes to make his hands dirty by optimizing Spark (in any language), implementing data lake architectures and training algorithms.