Populating a SQL Server Test Database with Random Data

By:   |   Updated: 2010-12-09   |   Comments (6)   |   Related: > Testing


Problem

If you develop new functionalities for your SQL database then probably you already encountered the typical problem of testing large scale databases. I have to run a series of performance and functional tests on a database with a few million rows, but I do not have the necessary test data. I heard about some excellent commercial tools but they are expensive and my company cannot afford them. How can I generate the test data on my own and populate it to the tables on the test database?

Solution

I will show you some tricks to generate and multiply rows for the test database. Each data type needs a different approach as you can see below.

Numeric data

First of all you will definitely need to generate some numeric data. The hands on idea would be to use the RAND() function. Unfortunately RAND is invoked once per query, not once per row therefore it returns the same value during the course of your query. So I suggest using other methods unless you would like to use a different seed value for each and every call which can be very painful for generating mass volume of data.

If you are looking for a real random value, you will get much better results with generating a NEWID() and calculating the checksum. The following expression sets INTVALUE1 field to a random integer value:

UPDATE TESTTABLE  SET INTVALUE1=CHECKSUM(NEWID())  

Further modifying this expression you can generate decimal values and using the ABS() function you can eliminate the negative numbers:

UPDATE TESTTABLE  SET DECVALUE1=ABS(CHECKSUM(NEWID()))/100.0  

Fixed range data sets

We can use the above method with some changes to generate values within a fixed data range. For example when you need to generate a Boolean value then you can choose from the data sets (0;1) or (Y;N) etc. You can use the parity of the random integer to convert it into Boolean data. Beware; you cannot use case function to sort NEWID results since NEWID is invoked for every CASE statement. The following expression generates random yes or no values for the column BOOLVALUE1:

--Generate random 0 or 1 value for every row  
UPDATE TESTTABLE  SET BOOLVALUE1 = ABS(CHECKSUM(NEWID()))%2  
--Translate the values to Yes or No UPDATE TESTTABLE SET BOOLVALUE1 = 'N' WHERE BOOLVALUE1='0' UPDATE TESTTABLE SET BOOLVALUE1 = 'Y' WHERE BOOLVALUE1='1'

Text data

Text data requires special attention if you would like to have a database with some reasonable content. You have to build an initial dictionary and then use it to breed more rows. For example it is common to use names in various database fields. The following example shows you how to create a table containing 100 different names in a few seconds.

--Create table for first names  
CREATE TABLE [NAMES1] (FIRST_NAME [varchar](20))  
--Create table for family names CREATE TABLE [NAMES2] (FAMILY_NAME [varchar](20))
--Fill first names INSERT INTO NAMES1 VALUES ('John') INSERT INTO NAMES1 VALUES ('Jack') INSERT INTO NAMES1 VALUES ('Jill') INSERT INTO NAMES1 VALUES ('Bill') INSERT INTO NAMES1 VALUES ('Mary') INSERT INTO NAMES1 VALUES ('Kate') INSERT INTO NAMES1 VALUES ('Kevin') INSERT INTO NAMES1 VALUES ('Matt') INSERT INTO NAMES1 VALUES ('Rachel') INSERT INTO NAMES1 VALUES ('Tom')
--Fill family names INSERT INTO NAMES2 VALUES ('Smith') INSERT INTO NAMES2 VALUES ('Morgan') INSERT INTO NAMES2 VALUES ('Simpson') INSERT INTO NAMES2 VALUES ('Walker') INSERT INTO NAMES2 VALUES ('Bauer') INSERT INTO NAMES2 VALUES ('Taylor') INSERT INTO NAMES2 VALUES ('Morris') INSERT INTO NAMES2 VALUES ('Elliott') INSERT INTO NAMES2 VALUES ('Clark') INSERT INTO NAMES2 VALUES ('Rock')
--Generate 10x10=100 different names SELECT * INTO TESTTABLE FROM NAMES1 CROSS JOIN NAMES2

You can take the above examples to quickly generate more and more different database records from a few row sample tables. Cross joins and self joins can easily create large datasets so be cautious when using very big tables. For example cross joining two tables with one thousand records generates results of one million records.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Tibor Nagy Tibor Nagy is a SQL Server professional in the financial industry with experience in SQL 2000-2012, DB2 and MySQL.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2010-12-09

Comments For This Article




Saturday, March 26, 2016 - 1:48:06 PM - saravanan Back To Top (41064)

 how to generate random records upto  1 million for emp table consisting of such columns emp middlename

 ,firstname,lastname,emailid,salary,dob,phone,doj plz give me query as soon as possible 

 

Tibor NagyTibor NagyTibor NagyTibor NagyTibor NagyTibor NagyTibor NagyTibor NagyTibor Nagy


Friday, September 14, 2012 - 1:56:44 PM - boris tyukin Back To Top (19514)

Interesting post but check out this cool open source project http://dataroller.sourceforge.net/

I stumbled upon it a while ago and you can do quite a few things with it.


Tuesday, September 13, 2011 - 4:54:08 PM - Ray Back To Top (14646)

I think you could use a case function for the Binary data element.  You only call newId() once per row if you code it this way.

case when checksum(newid()) % 2 = 0 Then 'Y' Else 'N' end

I also support you comments about not having\using production data.  We provided customers with an application that was full of HIPAA related infromation.  Many of them would not let us have a production database backup and the rest should not have.  Additionally, when you are working on a new application then by defintion you don't have access to production data.

Also, the exercise of generating "pseudo" often forces developers to seriously analyze aspecits of their data and its relationships that they otherwise would never consider.


Thursday, December 9, 2010 - 10:39:26 AM - Tibor Nagy Back To Top (10433)

@Nas:

Thanks for the suggestion. I will correct the typing error.

@Jerry LaGrou:

I provided the tip because there are certain scenarios when SQL developers cannot use production data for testing. I have real life experience on this topic: I was working for an educational software company but the company did not use its own software so we did not have any real data. Another example is the financial industry where the strict security rules prohibit us to restore production data to any test server.


Thursday, December 9, 2010 - 9:48:00 AM - Jerry LaGrou Back To Top (10432)

Um, way to hard.  Also, making your own test data will mean that it will only have features and behaviors you understand before the testing.  Use real source data.

Normally the reason for a smaller test database is so that the big scripts will run fast enough on the test machine to have quick test cycles.  So, first take a full restore of the production data.  Find the big tables, and delete most of the records.  One of my favorite ways of doing this, is to delete all the records where the ID (normally an identity int count-up) is not evenly divisible by one of two prime numbers.

Delete from where ((ID/1129)*1129!=ID) and ((ID/1151)*1151!=ID)

This leaves a consistent 0.2% of the data that has variable spacing between records.  (0.2% is too much, pick larger numbers; 0.2% is not enough, pick smaller)

If your FKs are setup well (cascading delete) this should complete the task, if not a few more deletes and hurray.

If obfuscating customer data is also a requirement, replace all the consonants in the sensitive data with an X, all the vowels with an A.  Now you are ready for tests with realistic values on all the flags, statuses, and other anomalous artifacts in the original data-set.

Happy programming!

 


Thursday, December 9, 2010 - 8:29:23 AM - Nas Back To Top (10429)

Good article. Like the suggestion for using joins to create mass data, great time saver.

Just one point on the first insert two statements - think a column length is needed as the subsequent insert fail.

--Create table for first names
CREATE TABLE [NAMES1] (FIRST_NAME [varchar](50))


--Create table for family names
CREATE TABLE [NAMES2] (FAMILY_NAME [varchar](50))

Think 50 should do the trick.

Thanks again.

 















get free sql tips
agree to terms