By: Alejandro Cobar | Updated: 2020-10-21 | Comments (11) | Related: > PostgreSQL
Problem
Not long ago, I was working on a SQL Server database that was growing quickly and was using a lot of disk space. After an initial check, I noticed that the amount of records in the largest table was not that many (only about 150,000) and the size of the database was a little over 100GB. The large table contained only 2 columns, an ID and a VARBINARY field that was quite wide (in terms of the amount of data stored in each row), hence the size. Eventually, I had to come up with a strategy to purge old information in chunks, because a single delete would have caused blocking and other potential issues.
I was thinking to myself, what if I run into a similar situation with other RDBMS (MySQL or PostgreSQL)? Would I see similar results if I had to delete lots of rows from tables with very wide rows? I got curious enough to give it a try and will share my results. I hope you find this as interesting as I did while doing the tests.
Solution
In this tip I will be presenting the results I compiled after performing records deletions within each respective RDBMS. The tests will be done against a set of 10,000 records in a table that only has an id column and a wide string (1,000,000 characters to be precise).
Initial considerations
As you probably already have guessed by the title of this article, I'm performing the tests against a SQL Server, MySQL and PostgreSQL database instance. To keep things as fair as possible, I'm performing each test in a Linux virtual machine that has the exact same specs; for the data type of the wide column, I tried to choose something as homologous as possible between them. One last detail for the RDBMS setup is that I tested them right out of the box, without any best practices, tuning, or specific parameter settings to possibly help improve performance.
VM Specs
- OS: Ubuntu Server 20.04 LTS
- CPU: Intel Core i7 2.7GHz (3820QM)
- RAM: 8GB DDR3 1600MHz
- Volume Size: 25GB
Information related to the RDBMSs and their respective runs
I used the latest available version of each product to keep things as fair as possible.
The scenario for the test goes like this:
1 - A simple test database is created in each DB instance and a table containing the wide column, with a structure similar to this:
CREATE TABLE test( **id INT NOT NULL PRIMARY KEY AUTOINCREMENT, ***wide VARCHAR(MAX) NOT NULL )
The syntax for the auto increment id varies within each RDBMS, but it is just to give you an idea.
The data type of the wide column is not exactly the same across all RDBMSs, but I tried to choose one for each that allows me to store the exact same data in each of them, to keep things as similar as possible.
2 - A round of 10,000 records is inserted and the following rounds of deletes are performed:
- Time to delete 1,000 rows, as well as the size of the respective transaction log file after its completion.
- Time to delete 2,000 rows, as well as the size of the respective transaction log file after its completion.
- Time to delete 3,000 rows, as well as the size of the respective transaction log file after its completion.
- Time to delete 4,000 rows, as well as the size of the respective transaction log file after its completion.
- Time to delete 5,000 rows, as well as the size of the respective transaction log file after its completion.
- Time to delete 10,000 rows, as well as the size of the respective transaction log file after its completion.
3 - I'm inserting a string 1,000,000 characters long, in the wide column of each row (the exact same across all of the rows), which I'm attaching as a sample in case you want to try it in your system.
4 - For the tests in each RDBMs, I made sure that only that RDBMS was running in the VM while the other two were completely shut down.
Insertion Round
These metrics were measured on a completely empty instance for each case.
RDBMS | Version | Wide Column DataType | Time to insert 10,000 records | Size of data file after the insertion |
---|---|---|---|---|
SQL Server | 2019 CU6 | VARCHAR(MAX) | 18min 25sec | 9,864MB |
MySQL | 8.0.21 | LONGTEXT | 13min 15sec | 9251.52MB |
PostgreSQL | 12.2 | VARCHAR(1000000) | 4min 20sec | 147MB |
Deletion Rounds
After each deletion round, the log file is cleaned/wiped before the next round takes place. For each example, there's a chart for comparison to get a better sense of the results. Note, I multiplied the time results by 10 just for visualization purposes.
1,000 Records Delete Test
RDBMS | Time to delete records | Size of log file after the deletion |
---|---|---|
SQL Server | 44sec | 72MB |
MySQL | 13.8sec | 954MB |
PostgreSQL | 0.095sec | 16MB |
2,000 Records Delete Test
RDBMS | Time to delete records | Size of log file after the deletion |
---|---|---|
SQL Server | 1min 2sec | 136MB |
MySQL | 29.24sec | 1.9GB |
PostgreSQL | 0.179sec | 16MB |
3,000 Records Delete Test
RDBMS | Time to delete records | Size of log file after the deletion |
---|---|---|
SQL Server | 50sec | 200MB |
MySQL | 2min 32sec | 2.8GB |
PostgreSQL | 0.264sec | 16MB |
4,000 Records Delete Test
RDBMS | Time to delete records | Size of log file after the deletion |
---|---|---|
SQL Server | 1min 36sec | 264MB |
MySQL | 3min 43sec | 3.7GB |
PostgreSQL | 0.134sec | 16MB |
5,000 Records Delete Test
RDBMS | Time to delete records | Size of log file after the deletion |
---|---|---|
SQL Server | 2min 25sec | 355MB |
MySQL | 5min 44sec | 4.6GB |
PostgreSQL | 0.314sec | 16MB |
10,000 Records Delete Test
RDBMS | Time to delete records | Size of log file after the deletion |
---|---|---|
SQL Server | 3min 53sec | 648MB |
MySQL | 10min 51sec | 12GB |
PostgreSQL | 0.639sec | 16MB |
Interesting Facts
- Without a doubt, PostgreSQL is the clear winner of this experiment. Out of the 3 RDBMSs, it was the one that made both the inserts and deletes in the least amount of time, while keeping the transaction log much smaller. It's very likely that the transaction log management of the pg_wal files and checkpoint mechanism are very well implemented, or there's something else behind the curtain that I'm just not aware of (since I'm not a PostgreSQL expert as of right now). If you have the explanation, feel free to share.
- Compared to SQL Server, MySQL has an advantage within the first 2,000 records, in terms of the time to perform the deletes. However, at every stage of the process it takes a big hit on the transaction log size which goes to the roof at the end.
- It would be interesting to perform the exact same experiment after applying best practices and properly tuning each RDBMS to see if better numbers are achieved. Although, I guess it would be a bit pointless for PostgreSQL as I don't think I would get better numbers than the ones I already got.
- In the world of RDBMSs, I have always heard that PostgreSQL is way better than MySQL. I have never used PostgreSQL in a highly transactional production environment, but after seeing the results from this experiment, I guess I can find some truth to that.
- I know for a fact that managing delete operations like these in SQL Server, by chunks, plays a whole lot better in terms of the transaction log usage. Also, locks on the table are way more relaxed and it allows you to interact with the table even if the deletes are taking place, something that doesn't happen on larger deletes.
- If you have to perform something similar in a real production system and you are working with MySQL, I guess a well thought out strategy has to be applied. Of course, after conducting several test runs to see the results first.
- It would have been interesting to throw more RDBMSs in the mix like Oracle, DB2, or jump into the NoSQL space and toss MongoDB, Cassandra, Redis, etc.
Next Steps
- Here's an article of mine that presents a script to delete a bunch of records in chunks and shows you useful information such as the amount of time to complete each batch, and the size of the transaction log file during the delete operation.
- Here are more articles, from the community, that you might find interesting
around the topic of delete data from a table:
- SQL Server Performance Issue for Single Row Delete or Update Operations – by Simon Liew
- Fastest way to Delete Large Number of Records in SQL Server – by Aaron Bertrand
- Deleting Historical Data from a Large Highly Concurrent SQL Server Database Table – by Daniel Farina
- PowerShell Script to Delete All Data in a SQL Server Database – by Jeffrey Yao
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2020-10-21