Generating Random Numbers in SQL Server Without Collisions

By:   |   Updated: 2013-09-17   |   Comments (20)   |   Related: > TSQL


Problem

From time to time, I see a requirement to generate random identifiers for things like users or orders. People want to use random numbers so that the "next" identifier is not guessable, or to prevent insight into how many new users or orders are being generated in a given time frame. They could use NEWID() to solve this, but they would rather use integers due to key size and ease of troubleshooting.

Let's say we want all users to have a random number between 1,000,000 and 1,999,999 - that's a million different user IDs, all 7 digits, and all starting with the number 1. We may use one of these calculations to generate a number in this set:

SELECT 
  1000000 + (CONVERT(INT, CRYPT_GEN_RANDOM(3)) % 1000000),
  1000000 + (CONVERT(INT, RAND()*1000000) % 1000000),
  1000000 + (ABS(CHECKSUM(NEWID())) % 1000000);

(These are just quick examples - there are probably at least a dozen other ways to generate a random number in a range, and this tip isn't about which method you should use.)

These seem to work great at the beginning - until you start generating duplicates. Even when you are pulling from a pool of a million numbers, you're eventually going to pull the same number twice. And in that case, you have to try again, and sometimes try again multiple times, until you pull a number that hasn't already been used. So you have to write defensive code like this:

DECLARE @rowcount INT = 0, @NextID INT = 1000000 + (CONVERT(INT, CRYPT_GEN_RANDOM(3)) % 1000000);
WHILE @rowcount = 0
BEGIN
  IF NOT EXISTS (SELECT 1 FROM dbo.UsersTable WHERE UserID = @NextID)
  BEGIN
    INSERT dbo.Users(UserID /* , other columns */) 
      SELECT @NextID /* , other parameters */; 
      
    SET @rowcount = 1;
  END
  ELSE
  BEGIN
    SELECT @NextID = 1000000 + (CONVERT(INT, CRYPT_GEN_RANDOM(3)) % 1000000);
  END
END

Never mind that this is really ugly, and doesn't even contain any transaction or error handling, this code will logically take longer and longer as the number of "available" IDs left in the range diminishes.

Solution

One idea I've had to "solve" this problem is to pre-calculate a very large set of random numbers; by paying the price of storing the numbers in advance, we can guarantee that the next number we pull won't have already been used. All it requires is a table and some code to pull the next number from the set. One way to populate such a table:

CREATE TABLE dbo.RandomIDs
(
  RowNumber INT PRIMARY KEY CLUSTERED,
  NextID INT  
) WITH (DATA_COMPRESSION = PAGE); 
-- data compression used to minimize impact to disk and memory
-- if not on Enterprise or CPU is your bottleneck, don't use it
;WITH x AS 
(
  SELECT TOP (1000000) rn = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM sys.all_objects AS s1 
    CROSS JOIN sys.all_objects AS s2
    ORDER BY s1.[object_id]
)
INSERT dbo.RandomIDs(RowNumber, NextID)
  SELECT rn, ROW_NUMBER() OVER (ORDER BY NEWID()) + 1000000
  FROM x;

This took about 15 seconds to populate on my system, and occupied about 20 MB of disk space (30 MB if uncompressed). I'll assume that you have 20 MB of disk and memory to spare; if you don't, then this "problem" is likely the least of your worries. :-)

Now, in order to generate the next ID, we can simply delete the lowest RowNumber available, and output its NextID for use. We'll use a CTE to determine the TOP (1) row so that we don't rely on "natural" order - if you add a unique constraint to NextID, for example, the "natural" order may turn out to be based on that column rather than RowNumber. We'll also output the result into a table variable, rather than insert it directly into the Users table, because certain scenarios - such as foreign keys - prevent direct inserts from OUTPUT.

DECLARE @t TABLE(NextID INT);
;WITH NextIDGenerator AS 
(
  SELECT TOP (1) NextID FROM dbo.RandomIDs ORDER BY RowNumber
)
DELETE NextIDGenerator OUTPUT deleted.NextID INTO @t;
INSERT dbo.Users(UserID /* , other columns */)
  SELECT NextID /* , other parameters */ FROM @t;

When we come close to exhausting the first million values (likely a good problem), we can simply add another million rows to the table (moving on to 2,000,000 to 2,999,999), and so on. It may be wise to set up some automation to periodically checking how many rows are left, so that you can re-populate well in advance of actually running out of numbers.

Performance Metrics for Generating Random Values in SQL Server

I ran both methods 1,000,000 times, filling the Users table up with these random UserID values. The following chart shows that, while generating a random number at runtime is faster out of the gates, the cost of duplicate checking (and retrying in the event of a collision) quickly overtakes the read cost of the predefined table, and grows rapidly and eventually exponentially as more and more values are used up:

Chart depicting average duration (in milliseconds) of the two methods of random number generation

In the first 1,000 inserts, there were zero collisions. In the last 1,000 inserts, the average collision count was over 584,000. This, of course, is a problem that doesn't occur when you *know* that the next number you pull can't possibly be a duplicate (unless someone has populated the Users table through some other means).

Conclusion

We can trade a bit of disk space and relatively predictable (but not optimal) performance for the guarantee of no collisions, no matter how many random numbers we've already used. This doesn't seem like a good trade in the early going, but as the number of ID values used increases, the performance of the predefined solution does not change, while the random numbers generated at runtime really degrades performance-wise as more and more collisions are encountered.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2013-09-17

Comments For This Article




Tuesday, October 9, 2018 - 12:57:03 PM - Aaron Bertrand Back To Top (77906)

@Ramon I didn't include error handling or isolation semantics but, no matter what method you choose, you'll need to protect concurrency using transactions / elevated isolation. In this case it is simply not necessary because the delete and the assignment happen in a single, isolated statement, which is an implicit transaction on its own. There is no way for someone to come and grab the same number while that is happening though, if you want to be really really really sure, I guess you could put WITH (HOLDLOCK) on the SELECT inside the CTE.

I created a table like the above with 5,000,000 rows, then a table with a single primary key int column. I then ran this in 5 different windows; It took quite some time, but I never got a single constraint violation, nor a failed insert, and the number of rows in dbo.Uniques was 5,000,000, so no duplicate violations.

DECLARE @t TABLE(NextID INT);
;WITH NextIDGenerator AS 
(
  SELECT TOP (1) NextID FROM dbo.RandomIDs ORDER BY RowNumber
)
DELETE NextIDGenerator OUTPUT deleted.NextID INTO @t;
INSERT dbo.Uniques(i) SELECT NextID FROM @t;
GO 1000000

Monday, October 8, 2018 - 5:23:53 PM - Ramon Leon Back To Top (77897)

Doesn't this contain a concurrency error?  What if two clients hit this at the same time and select the same lowest number? Won't one get it because the delete suceeds and the other fail to get a number at all due to nothing being deleted? Isn't there a race condition between the select and the delete?


Saturday, January 14, 2017 - 10:19:39 AM - Tom Back To Top (45296)

 Hi Aaron,

Very useful article indeed.  Of the three example, i am interest on CRYPT_GEN_RANDOM.  What about applying the seed parameter with this function?  Will it reduce the possibility of collision?  Unfortunately, Micorsoft didn't elaborate more on the use of seed, assuming most reader will have the knowlege of seed :(

Thanks.


Tuesday, January 20, 2015 - 1:52:56 PM - Rodolfo Reyes Back To Top (36002)

This was just what I was looking for.

It worked fine for my case.

Thank You.

 


Thursday, April 3, 2014 - 10:33:16 AM - John Grover Back To Top (29964)

You can also use

ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))

to generate random positive integers


Saturday, December 7, 2013 - 3:30:00 PM - Jeff Moden Back To Top (27729)

I agree with Aaron on the "good habits" thing even for one-off code.  With that in mind, I'll also suggest the following for the table structure especially since one of the requirments is that NextID must be unique.

 CREATE

TABLE dbo.RandomIDs
        
(
         RowNumber  INT     NOT NULL
        ,NextID     INT     NOT NULL
        ,CONSTRAINT PK_RandomIDs_RowNumber PRIMARY KEY CLUSTERED (RowNumber)
        ,CONSTRAINT AK_RandomIDs_NextID    UNIQUE NONCLUSTERED   (NextID)
        )
;

Also, remember that ROW_NUMBER() starts with the value of 1 and not 0.  The code needs a minor tweek if "we want all users to have a random number between 1,000,000 and 1,999,999 ".

As a bit of a side bar, if you think you'll ever need to insert additional IDs, then might want to add a check constraint for NEXTID >= 2000000 after you've populated the table with the first million rows just incase someone forgets what the rules are.


Saturday, December 7, 2013 - 12:10:26 PM - Uma Umakanth Back To Top (27728)

Aaron, I see your point. Thanks


Friday, December 6, 2013 - 8:26:16 PM - Aaron Bertrand Back To Top (27725)

Uma, why reduce collisions when you can completely eliminate them? What benefit do you get from generating the number at runtime over generating them in advance?


Friday, December 6, 2013 - 5:36:41 PM - Uma Umakanth Back To Top (27722)

Interesting technique.

Can we just use your original duplicate checking logic but select a random number between 1 and 9 million and add it to 1 million? This I believe will drastically reduce collision at least until half way (about 5 million). Please let us know if you can get stats on this.


Wednesday, November 20, 2013 - 1:38:57 PM - Matt Kearns Back To Top (27557)

Select CAST(RIGHT(CAST(CAST(NEWID() AS VARBINARY(36)) AS BIGINT), 10) AS CHAR(10))

gives me a usable 10 digit random number quickly. Adjust the number 10 to any number between 1 and 19 to get a random big integer at that length. Cast as CHAR, it can be concatenated to a string, which I've used this extensively in unit tests.


Wednesday, October 2, 2013 - 5:20:54 AM - Dav0id Back To Top (27007)

Interesting technique, thanks for sharing this.

One minor detail I noticed, is that the ROWNUMBER()  function returns a BIGINT, but the random id table only holds INTs. Presumably for a large number of IDs in the long term, we'd want BIGINTs.


Friday, September 27, 2013 - 4:39:31 PM - Dave Back To Top (26978)

Nice article.

For requirements like these I prefer a pseudo-random number. A number which is unique in the database which can be used as an index, but is calculated from factors.

In one case, I used the record ID of the CustomerID, converted it to string and appended the string of the record ID of the order.

Worked like a charm and didn't cost a whole bunch of cysles.

But when the customer insists. This is the way to do it.


Friday, September 27, 2013 - 10:22:20 AM - Aaron Bertrand Back To Top (26974)

jhogue because of what I said in the first paragraph:

 

> They could use NEWID() to solve this, but they would rather use integers due to key size and ease of troubleshooting.


Friday, September 27, 2013 - 10:21:33 AM - Aaron Bertrand Back To Top (26973)

PhyData I understand your point, but the requirement I am addressing here is not merely picking random numbers, it is picking numbers that are *randomly ordered* and *also unique.* Picking 128467 twice doesn't help here, because the second time you pick that random number, it can't be used.


Friday, September 27, 2013 - 10:10:06 AM - jhogue Back To Top (26972)

Newbie question: why not simply use NEWID() ?


Friday, September 27, 2013 - 9:51:14 AM - PhyData DBA Back To Top (26971)

This is great code. Please do not think my comments are a reflection of your solution.  It is wonderful.  However, is extremely poorly named. You are not generating random values if you are testing for collisions and pulling from a known set of values. A random result will have random collisions or it is not random.  It was these kind of NOT random number generators that had to be replaced in thousands of systems in the 80's and 90's. There have even been movies made about this type of Random mistake. Sorry for the pun.


Thursday, September 19, 2013 - 8:27:10 PM - TimothAWiseman Back To Top (26874)
Aaron that makes sense. I too almost always include an order by with top, the few exceptions involve times where I don't care at all which row is returned. This seems to fall into that category of not caring which row is returned, but it is definitely a good habit to be in. Thanks for answering!

Thursday, September 19, 2013 - 5:40:54 PM - Aaron Bertrand Back To Top (26871)

Timothy: habit / best practice. Without ORDER BY, TOP is undefined, so while you may "always" observe the rows you get, it isn't guaranteed. I'll opt for accuracy and not promoting undefined query structures over saving 2 seconds on a query I'll typically only run once in the lifetime of a system.


Thursday, September 19, 2013 - 3:01:10 PM - TimothyAWiseman Back To Top (26870)

This is a great article and an interesting approach.  Out of curiosity, why did you include the "ORDER BY s1.[object_id]" in your CTE?  I cannot see a need for it and when I played around with the query it goes noticeably faster if that is left out.


Tuesday, September 17, 2013 - 8:37:41 AM - Junior Galvão - MVP Back To Top (26829)

Hi Aaron,

 

Great article, good work.

 

I like this form you show comparation values in the graphic.

 

Regards.















get free sql tips
agree to terms