What is a modern data platform?
A modern data platform is an advanced infrastructure and ecosystem designed to efficiently manage and process large volumes of data from diverse sources. It encompasses a combination of software tools, technologies, and frameworks that enable organizations to collect, store, process, analyze, and visualize data in a scalable and reliable manner.
In the modern data platform, we see following big blocks:
- Data ingestion: loading of data in the platform
- Data storage: Storing data in a database or data lake
- Data processing: transforming data and extracting insights
- Advanced analytics: training smart algorithms and data science models
- Data serving: the data can be served via an API or via visual reports
- Consumers: the actors that consume data, can be a report, application, …
- Scheduler: the orchestrator to schedule all data loads
- Platform: the underlying cloud platform
Data ingestion is the process of collecting, importing, and processing raw data from various sources into a system that can make use of the data. This process typically involves retrieving data from different sources such as databases, files, APIs, sensors, and other data repositories.
Across AWS, Azure and SAP there are several possible tools to choose between, each having its own features. Tools providing a graphical interface (Glue for AWS or Data Factory for Azure) can be used or tools which require more coding (e.g. Docker containers with Python) are possible.
After data ingestion, the data is stored in either a data lake or a database. While a data lake allows more flexibility about the file format and type of files (e.g. non-tabular data, images, …), a database expects tabular data. Depending on the data you have, a choice can be made. A combination of both can also be used, for example one can first load raw data in a data lake, and then load a transformed version of it in a database.
For Azure, the most commonly used data lake is Azure Data Lake Storage Gen2. For AWS, its counterpart is S3. When it comes to databases, the most common databases (PostgreSQL, MariaDB, MySQL, …) are both hosed on Azure or AWS.
The goal of data processing is to transform raw data into data that can be easily analysed, visualized, or used in machine learning models. This process usually involves cleaning and filtering data to remove any inconsistencies, errors, or duplications. Data processing also involves combining multiple data sources to provide a more comprehensive view of the data.
This can also already partly be done in the data ingestion step, depending on the type of data and the use case. Some companies prefer to load raw data in the ingestion step without doing any actual analysis on it, because they want to have the raw data in any case.
Due to its ease of set-up and various features, Databricks is a cross-platform tool that is commonly used for data processing. It offers Single-Node and Multi-Node compute with managed Spark for distributed processing. Other tools commonly used are Synapse (Azure) or Athena (AWS).
Advanced analytics refers to the use of complex analytical techniques and tools to extract insights and knowledge from data. It involves the application of statistical and mathematical models, machine learning algorithms, and data mining techniques to analyze large and complex datasets.
Also in this case, Databricks offers a valid solution because of its support for both R and Python and machine learning features (feature store, auto ML, model deployment, …). Azure Machine Learning Studio (Azure) and Sagemaker (AWS) both offer a cloud-native solution.
Serve and consume
Data serving is the process of providing access to data stored in a database or other data storage systems. Data serving systems provides a range of tools and features to support data access, such as query languages, APIs, and user interfaces. This can also involve caching mechanisms to improve performance by reducing the time needed to retrieve data from disk or other storage systems.
PowerBI (Azure) at this moment is by far the most common used data visualisation tool. AWS offers Quicksight and SAP SAC.
A data scheduler is used to automate the process of running data-related tasks, such as data processing, data ingestion, or data transformation, at specified intervals or according to predefined criteria. It allows data engineers and analysts to schedule and manage data workflows, reducing the need for manual intervention and improving operational efficiency.
Airflow is the most commonly used scheduler. This tool can be installed cross-cloud and in several ways (e.g. on Azure Kubernetes Services (Azure) or Elastic Kubernetes Service (AWS)). It offers a visual interface for tracking the status or workflows. Tasks can be Python Functions, SQL queries, or any other kind of executable code.
All of the above can be done in a single platform or a combination of multiple platforms. The currently existing platforms mostly offer the same capabilities but under a different hood:
- Access management: user authentication is an important aspect of a (cloud) platform. Azure Active Directory is by far the most common one.
- CI/CD and infra-as-a-code: Azure offers Azure Devops for CI/CD and mostly uses Bicep for infra-as-a-code. AWS offers CodePipeline for CI/CD and Cloudformation for infra-as-a-code. Terraform can be used for both Azure and AWS for infra-as-a-code.
- Cost management: a visual interface with alerting is also a necessity and present in most (cloud) platforms.
- Key Vault: a secure way of storing keys for access. Azure offers Azure Key Vault while AWS offers AWS Key Management Service.
- Monitoring: Monitoring the metrics and components of the platform.
- Versioning: Azure offers Azure Devops while AWS offers CodeCommit. Independent tools such as Github or Gitlab can be used as well.