Everyone who has already come in touch with data science, has already heard of features used in such models. One aspect that can become quite challenging, is reusing features in a consistent way, across several team members, projects and in environments. In this article, I will explain the most commonly used way to resolve these problems: a feature store.
Catch-up on the terminology used in this blog by reading …
– Things to consider when creating a Data Lake – https://lytix.be/things-to-consider-when-creating-a-data-lake/
– Kimball in a data lake? Come again? – https://lytix.be/kimball-in-a-data-lake-come-again/
– Managed Big Data: DataBricks, Spark as a Service – https://lytix.be/managed-big-data-databricks-spark-as-a-service/
Data lake and data warehouse
There are no shortcuts, before thinking of data science, data storage and collection are of vital importance. The image below depicts a possible of how a data lake and data warehouse can be used to store data. Note that this is not fixed and strongly varies depending on the specific needs of the company.
Having your data decently structured allows your data profiles (data analysts, data scientists) to explore the data and investigate which features can be made, and which features are useful for your model. This phase takes place before actual industrialization of features and inevitably consists of trial and error. This is not the most ‘popular’ part of the job for a data scientist, but I still consider it as an important part, as you need to have a good understanding of the data when you are building a ML model.
“A ‘Feature’ is an attribute/column/aggregation that can possibly improve the accuracy of your model. A Feature Store improves reusability of features; reducing leadtimes and duplicate logic.”
Once features have been successfully identified and tested in a model, it is useful to think about industrialization. This allows the features to be reused in your own model, but can also easily be reused by other models (of your colleagues, for example).
Input data for a feature store
The previously mentioned data warehouse is one input of the feature store. Several operations (sum, average, …) using SQL or Python (Pandas, PySpark) can be executed on the data to create features. In addition to data coming from the data warehouse, real-time data can also be used to make features (such as interactions on your website, clicks, events, etc. ). Of course, for exploration purposes, this data can also be stored in a data lake or data warehouse. The real-time dimension of this data will be of special use in the consumption of it in real–time models, which are discussed further on.
Feature store guidelines
When constructing such a feature store, I see the following important aspects that should be in place:
- Cleaning data: it should be possible to use the features directly as input of the model. Thus it is necessary to handle missing data, normalize data (if necessary), perform dummy/one hot encoding, etc. .
- Documentation: indicate and describe which features are present and how these are constructed. Details such as the aggregation used or the timeframe length are of big importance. When such information is unclear or unknown, the adoption of the feature store will be hard and data scientists will not know which features to use.
- Monitoring and data validation: with monitoring, I do not only mean performance monitoring or monitoring that the load has succeeded. I also mean monitoring several characteristics of each feature, such as distribution, number of features missing, number of categories, etc. . When, all of a sudden, the characteristics of a feature change, it is very well possible that the performance of a model will not be as expected anymore (i.e. data drift which will cause model drift). In an ideal situation, a dashboard visualizing these statistics is made so all of this can easily consulted.
Feature store types
We can identify an offline and an online feature store. The offline feature store is used for serving large batches of features to create train/test data sets and for batch applications. The online feature data storage can be used for an online model (e.g. via a REST API). For the latter, the preservation of the real–time character of the data is especially important.
Offline feature store
This type of feature store consists of historical features for certain moments in the past that can be used to create training and testing datasets (e.g. training data for the years 2012-2018 and test data for 2018-2020). These features can be used ‘as is’ as input in the model. When companies have built a rich feature store, data scientists can quickly create new models, as they can skip most of the data exploration phase. However, in reality, it remains useful to check whether additional features can be created for the specific use case of the model. These new features can then in turn be industrialized again. As depicted in the image below, the real-time nature of features is of less importance and can be used from the data warehouse/lake (if stored there).
Online feature store
In an online feature store, the real-time nature of the features is important, as such feature stores are primarily used to serve real-time and / or online models. These online feature stores are mostly row-based with key-value pairs that can be retrieved with very low latency (e.g., with Redis or Cosmos DB).
Feature stores are of vital importance to speed up your model development and to have a mature production environment for deploying models. However, they should be constructed with significant thought, otherwise adoption in the company and the use will be easily lost. If you need any help or have questions, please contact us!
Big Data architect and Data Scientist
This blog is written by Tom Thevelein. Tom is an experienced Big Data architect and Data Scientist who still likes to make his hands dirty by optimizing Spark (in any language), implementing data lake architectures and training algorithms.