Performance tuning using indexes
Do you know that feeling when you finally have your data warehouse up and running using dozens of stored procedures written in SQL? You feel satisfied when your pipelines have been running for days without throwing an error. But then you notice that the time it takes to run your pipeline is tremendous and the cause of it are the long loading times of your SQL queries…
You could have a look at your code and start taking action. There are many options out there to optimize your loading time. One of them is the usage of indexes. Let me introduce a high level overview of using indexes which will decrease the time a query has to run.
What are Indexes?
Imagine that you are reading a book and want to know on which pages they are talking about a specific subject. Then you would go to the front or back of the book and look for that subject where you would see the pages listed out. This action only takes seconds of your time. Can you imagine not having the index pages and having to read through the whole book until you find your subject? You could be lucky and it occurs in the beginning of the book, but it could also be at the end… This is exactly how your database engine behaves when there is no index active.
SQL Server operates in a similar way as your action taken to finding that subject. Indexes are the method by which SQL Server sorts and organizes the rows in tables to make it easier to find the desired row(s). In other words, SQL Server will perform a ‘scan’ over your entire table, row by row. Obviously, this operation takes some time and that’s why indexes were introduced. The action of ‘jumping’ into the right part of a table is referred to as a ‘seek’.
Microsoft’s definition: An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
Main benefits of using indexes:
- Overall query time reduction
- I/O reduction
- CPU time reduction
- Less cache use
In this blog we will focus on clustered and non-clustered indexes with an example of the clustered index. Be aware that multiple other types of indexes exist.
Indexes in a nutshell
Don’t have time to read the whole blog? Let me explain indexes in a few sentences.
An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view and has more benefits. There are multiple types of indexes, the most common are clustered indexes and non-clustered indexes. You would use the clustered index if there is only one column you would like to sort by. If you would like the ability to sort on multiple columns, the non-clustered index is recommended since the underlying structure is different from the clustered index.
The difference between a clustered and a non-clustered index
Clustered indexes sort and store the data in the table in one order because the table is stored once on one disk.
Note: there can only be one index per table which can cover multiple columns.
Non-clustered indexes have another approach of storing the data. Non-clustered indexes create a copy of a subset of the columns you have specified for your index. The table will now be sorted by that copy of columns. So the structure of storing and sorting the data differs from a clustered index.
So when to use which index?
For example, you have a table which contains the ranking of a soccer league which contains the name, club number, goal difference, … and you would like to be able to sort by Name but also to sort by ranking number. In this case, you would need a non-clustered index because a clustered index is only able to be sorted in one way.
|Return any column you request in your query||Contains all columns||Contains a subset of columns||Must perform a second query to return any column|
|Resource intensive to handle changes on column||Easier to handle changes on column|
Considerations when using indexes
- The clustered index should be unique for the column(s) you have selected.
- Try to include only columns that are unlikely to change values (immutable columns)
- For example: an identity column, an order column, a GUID, …
- The non-clustered index should be on columns which will be likely to searched
- For example: given name, family name, birthdate
WOW this works great! Let me add indexes to all of my tables!
It feels like magic, but it sure is not. As with everything in life, nothing comes for free. An index has to be created, stored, updated and maintained. This also means, for every row which is inserted, deleted, or updated in a table, all the non-clustered indexes – which are relevant – must also be updated. This action is also referred to as the write penalty. In most cases, it is worth paying this penalty but of course not when your index is never used.
Did you know?
- If you create a table with a UNIQUE constraint, SQL Server will automatically create a non-clustered index
- If you create a table with a PRIMARY KEY constraint, SQL Server will automatically create a clustered index
- Indexes are automatically ‘maintained’ when you update a column. However, it is a best practice to maintain your indexes yourself every once in a while. This can be done by using the Maintenance Plan Wizard of SMSS
- The order of the columns for sorting your indexes is important
How to use indexes?
Indexes are actually easy to set up. You only need the statement described below:
Create clustered index [your index name] on [your table] [(your column(s))]
Create nonclustered index [your index name] on [your table] [(your column(s))]
Below you can find an example of a clustered and a non-clustered index with 2 datasets from Kaggle.
Firstly, we create a simple join between 2 tables and a where statement:
There is only 1 index active on the dbo.dataset table (Primary Key Constraint).
This is the execution plan after running this piece of code:
As you can see, the database engine performed a table scan on the second table (DatasetTag), which means it searched row by row. This action took 4 seconds.
Let’s see what happens if we add a clustered index on the PK of the DatasetTag table.
And after running the query again…
We can see that the total time of loading has been decreased by half!
When you get this output, SQL Server also recommends creating an index on certain columns. The message I got, was as follows:
“Missing Index (Impact 31.8038): CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname>] ON [dbo].[dataset] ([totalDownloads]) INCLUDE (CreatorUserId, OwnerUserId, OwnerOrganizationId, CurrentDatasetVersionId, CurrentDatasourceVersionId, ForumId, Type, CreationDate, lastActivityDate, totalViews, totalVotes, totalKernels)”
Let’s follow this recommendation and see what happens …
After creating a non-clustered index on the ‘totalDownloads’ column of the Dataset table (first table) and after running the query again, we can see that instead of the total loading time being decreased, it has even increased!
Unfortunately, there’s no magic involved with indexes…
The reason why this run took longer, is because the way of sorting the data had been changed and this comes at a certain cost.
So don’t just execute the recommendations of SQL Server without checking whether this change would have a positive impact.
Of course, these queries are simple and the amount of data is not as huge as you would have in your data warehouse. But I’m sure you can imagine that the usage of indexes on your queries could have a positive impact.
Indexes are easy to set up and frequently used for writing your queries in SQL Server. Obviously there are many other tricks to speed up your query performance. Stay tuned for other performance tips!
Jani De Kort
Data & analytics consultant @ Lytix