Loading mechanisms – Part I
As there are huge amounts of data available within companies, data is also moved in increasing quantities from one data storage to another for multiple reasons. As copying data can come with a longer load time and higher costs, you want to make this process as efficient as possible. Luckily, there are multiple loading mechanisms available that can optimize this flow. In this blog, you will be introduced to different loading mechanisms. What methods are available and what are the (dis)advantages of each method? Finally, the difference between an incremental and full load will be discussed, as well as the factors that influence your decision choosing between these two options.
There are multiple ways to get your data available in another location or place of storage. In general, there are two options: Replication or virtualization. In the next part, these two methods will be discussed in detail, as well as some variations.
With a replication method, you copy the data from a source to a target storage. In this case, the data is physically stored in the target side. An advantage of this is that your data is typically faster accessible when it is physically stored there.
A first replication example is a batch load. As the name says, the data is loaded in batches. This means that not each change in the data (Insert, update, or delete) is immediately copied to the target, but that multiple changes are copied together at the same time. A batch load typically happens once a day, during the night. An advantage of a batch load is that you limit the impact on the source storage. The data is only extracted once a day, which means that the source is only impacted once during the night when the storage is already used less by users. A disadvantage of this method is that the data at the target side is only refreshed once a day. This means that users must wait for the next day until they can access new data. A batch load can be seen as a default load that is used in most cases. This is because it is easy to set up and is sufficient in most cases.
A mini batch works similar as a batch load, but here the data is copied more than once a day. The data copy interval varies typically between once a minute and once every few hours. An advantage of a mini batch load is that the data is updated more frequently at the target storage. A disadvantage is that the source storage will be impacted multiple times a day. This impact can be minimized if an incremental load is done each time.
With streaming, changes at the source side are immediately pushed to the target storage. This has as an advantage that the target is always up to date. A disadvantage is that it is more complicated to set-up. While it may seem nice to have your target side always up to date, it should be noted that not all scenarios require real-time data with streaming.
Virtualization is a special type of loading your data, as your data is not physically stored at the target side. A disadvantage of this method is that it typically takes a little bit longer to get the result of your query compared to replication where that data is physically stored at the target. However, some technologies limit this disadvantage to a minimum.
Examples of tools that allow virtualization are the Serverless SQL pool in Azure Synapse Analytics, which allows you to query files directly from your data lake, and Power BI DirectQuery. For the last one, queries are sent to the source database. In this case, an impact at the source side should be kept in mind. It should also be noted that compared to replication there is no data load interval specified. Queries are only sent to the source side when there is a request from a user. For example, when a visual is changed or filter in Power BI DirectQuery.
Full and Incremental Load
Another decision that needs to be made is what data you want to load exactly. For this, there are two options. A full load, or an incremental one.
In a full load, as the name says, all the data from the source will be extracted and copied to the target storage. An advantage of a full load is that its set-up is easy. In cases where the source does not contain a lot of data, a full load can be sufficient.
However, for larger data loads, a full load might not be the desired solution. A full copy of large tables might result in high costs and longer load times. For this, an incremental load will be the preferred solution. With an incremental load, you only copy the changes that have been made after the last load. For tables with millions of rows, where only a few thousands of records are changed per day, an incremental load is the way to go. It should be noted that this method requires a bit more configuration as compared to a full one. The source side should also be able to track the changes in some way. If the changes cannot be tracked, it is not possible to do an incremental load. Luckily, there are multiple ways to track your changes at the source side. In a next blog, the different options will be discussed in detail.
When choosing between an incremental or full load, two key factors should always be kept in mind. The first factor is the data volume. The higher the data volume, the bigger the potential improvement in cost and load time. For a table with only a few hundreds or thousands of records, it might not be worth it to implement an incremental load. For a very large table, it will be a lot more interesting to invest some time creating an incremental load. The second factor is the complexity of the incremental set-up. For some data sources, a built-in functionality can be enabled that tracks all data changes. This means a lower complexity. For other sources however, it is possible that there are no functionalities like this. These sources require self-developed tools to be able to track the changes. This results in a much higher complexity of the incremental set-up.
In this blog, different options to load your data are discussed. As there are many options, each with its specific advantages and disadvantages, it should be noted that there is no silver bullet that can be applied to each problem. Depending on the problem, some options will be more beneficial than others.