Strategies for Loading Your Data Incrementally
During your ETL, large amounts of data are loaded from a source into a target storage. As this is something that is often done multiple times a day, you want to optimize this process as much as possible to reduce load time and minimize costs. An incremental load is something that emerges quickly as a possible solution for achieving this. Compared to a full load, an incremental load processes only the records that have changed between two load times. This is done to avoid copying millions of rows per load while in fact only a few hundreds or thousands of rows have changed.
But is it this easy to change from a full load to an incremental one? Is it possible to do an incremental load on all tables? In this blog, the multiple possibilities of tracking the data changes will be discussed. As well as the advantages and disadvantages of each method.
Tracking your data changes
In general, there are two options of tracking your data changes. An option that quickly comes to mind is tracking the data changes based on a column in the table. This column can be added specifically for tracking the changes, or the column can already be present. Another option is tracking the changes via a log file or table. This log will track all the changes that are made to the table’s rows. These last two options are depending on ‘attributes’ that your source system provides you to identify changes, there is also an option where you can identify changes yourself by using a hash.
Note that it is not always possible to change the source system. Depending on if it is a self-managed or provider-managed data source, there should be looked at what is already available in the system or not.
A quick overview of the possibilities and their (dis)advantages can be found in the table below. In the next parts of this blog, they will be covered more in detail.
|Column-Based Incremental Load – Numeric Primary Key||No changes required on source sideDetect incremental inserts||Not possible to detect deletes and updates, only works with insert-only tables.|
|Column-Based Incremental Load – Datetime||Detect incremental inserts and updates||More complex set-up|
|Log-Based Incremental Load||Detect incremental inserts, updates and deletes||Higher complexity to set-up|
|Hash-Based Incremental Load||No identifier needs to be present in the source||A high ‘cost of comparing’ to identify changes|
Column-Based Incremental Load
For the first type of incremental load, the data changes are tracked based on a column’s values. For this, there are two possibilities: Using a numeric primary key column or a column with the time of the creation or last modification of the row. If there is no such datetime column and your primary key is not incremental (e.g., it is just a unique, random number), then this technique cannot be applied.
Numeric Primary Key Column
The first possibility is using a primary key column of a numeric datatype with an auto-increment. Auto-increment means that a unique, increasing number is generated each time when a new row is inserted into the table. This column can be used to determine what rows have been added after the last time a load was done.
An advantage is that few changes on the source side are required. Most tables in databases have a numeric primary key column already. Unfortunately, this set-up comes with some disadvantages too. The column allows to detect inserted records, but it is impossible to detect records that have been updated. It is also not possible to detect only the records that have been deleted between two load times. It is possible to detect deleted records by identifying ‘gaps’ in the increasing values of this column, but this will be less desired for an incremental load on a table with millions of rows.
This incremental load can be a good option for databases/tables where no updates or deletes can occur. For example, financial databases typically allow no changes to the data due to auditing reasons.
There is a customer table with the customer’s name and country. At time t1, there are three records in the table. At that time, all three records are loaded into the destination storage. At time t2, two records have been added, one has been deleted (Customer_ID 2) and the country of a customer has been changed (Customer_ID 1, France à India).
By only using the Customer_ID column, the rows inserted between t1 and t2 can be found. However, it is impossible to determine that the country of an existing customer has changed.
Customer table, time t1
Customer table, time t2
The second column-based incremental load is using a datetime column. These columns typically contain the creation and/or last modification date.
A creation datetime column shows when a record has been added. This column allows to detect all records that have been inserted after the last load. However, no information is available about deleted and updated records. When a last modified column is available, the updated rows can be detected too. A table can have both a creation and last modified time column, or both columns can be integrated into one column. It is still impossible to detect any deleted rows. However, depending on the case, deleted rows can be handled. Unfortunately, this will result in a less clean solution.
For example, in a certain table, most deletes happen within x days after creation. To handle most of these deletes, all data in the source system with a creation date between x days ago and now is extracted. This can then be compared with the data in the destination source to do inserts, updates, and deletes. While this method is copying more than only the incremental records and it increases the complexity, it can be useful in certain cases where deletes need to be handled and no delete logs are available (See Log-Based Incremental Load).
There is a customer table with the customer’s name and country. A column Last_Modified is available too. At time t1, there are three records in the table. At that time, all three records are loaded into the destination storage. At time t2, two records have been added, one has been deleted (Customer_ID 2) and the country of a customer has been changed (Customer_ID 1, France à India).
By only using the Customer_ID column, the inserted and deleted records can be found. However, it is impossible to determine that the country of an existing customer has changed.
By using the Last_Modified column, it is now possible to detect the inserted rows with Customer_ID 4 and 5, and to detect the changed record with Customer_ID 1
These will be extracted from the source table and upserted into the destination table. Note that the deleted record with Customer_ID 2 will not be deleted in the target storage, as the delete will remain unnoticed.
|1||Satya Nadella||France||2022-03-05 13:41:20.000|
|2||Jeff Bezos||USA||2022-03-05 13:42:20.000|
|3||Elon Musk||South-Africa||2022-03-05 13:43:20.000|
Customer table, t1
|1||Satya Nadella||India||2022-03-15 08:36:43.000|
|3||Elon Musk||South-Africa||2022-03-05 13:43:20.000|
|4||Tim Cook||USA||2022-03-15 08:37:57.000|
|5||Bill Gates||USA||2022-03-15 08:38:43.000|
Customer table, t2
Log-Based Incremental Load
For a log-based incremental load, a log file or table is needed. This file tracks all changes that are made on the source files. It shows where the changes have been made (E.g., a specific row in a certain table) and what type of change it was (Insert, Update or Delete). Most databases have functionalities that can be enabled to track the changes on a table.
The advantage of a log-based incremental load is that all types of changes are tracked. This guarantees that all inserts, updates and deletes can be done in the target storage. A disadvantage is that it is a more complex set-up. First, the logging should be enabled on the source side. This can be easy when the database has this feature as an option but might be more complicated if it needs to be created by yourself. The impact on the performance of the source database should also be kept in mind. Especially if logging is active on many tables on which a lot of changes happen. Second, the logging should be processed into your ETL logic, so that the updates and deletes are executed well.
Note that there are many variations on the logging tables. Some tables capture only the action (Insert, Delete or Update), while other tables capture the action and the new values for that row.
Just like in the previous examples, there is a customer table with the customer’s name and country. However, now there is a customer log table as well. This log table has a column Customer_ID, referring to the Customer_ID column of the customer table. The log table also has an action and timestamp column. At time t1, there are three records in the customer table. Each time a record was added, a row with an ‘Insert’ action was added to the customer log table too. At time t2, two records have been added, one has been deleted (Customer_ID 2) and the country of a customer has been changed (Customer_ID 1, France à India). For the customer log table, this means that two rows are added with the ‘Insert’ action and one row per ‘Update’ and ‘Delete’ action.
Compared to the previous examples, both deletes and updates can now be executed in the target table, since the log table registers all the inserts, updates and deletes between the previous load time and now.
Customer table, t1
Customer log table, t1
Customer table, t2
Customer log table, t2
If your source system does not support logging the inserts, updates and deletes, an incremental loading mechanism can still be setup. At the start of each load, you will identify changes in your table by making a comparison of what exists in your source and what exists at your end of the ETL pipeline. Instead of comparing individual records, you can concatenate values into a ‘hash’. A hash is the results of a hash function that has a very high chance of being unique and is made-up of only a few bytes. The advantage of comparing hashes instead of individual values is that the size of a hash is typically many times smaller than the original values (resulting in a more effective hash). Hashing the same values will always result in the same hash value.
Two ways of hashing techniques can be used: creating one hash value that contains all values of a table on your source side and comparing it to the hash value that exists on what exists on your end of the pipeline. If these values differ, you know that there is an inconsistency caused by an insert, update or delete. While you know that there is a change somewhere in the table, you don’t know which record changed. A second option is to create a hash of each row. By comparing the hash values of the source with the ones from the target, you can identify inserts, updates, and deletes. This is more efficient than comparing the values of each column between the source and target.
Multiple methods of an incremental load were discussed in this blog. While it is easy to say that an incremental load is the way to go for your ETL, it may not always be possible, or it can be a longer and more complex process to set-up.
As covered, there are multiple options to track the changes at a source, with each its specific advantages and disadvantages. Depending on the case, these need to be taken into consideration to find the optimal solution.