Forgotten SQL Server Functions - VARP, SOUNDEX and NTILE

By:   |   Updated: 2013-03-29   |   Comments (14)   |   Related: 1 | 2 | 3 | 4 | > Functions System


Problem

This article is ideally suited to SQL Server developers and those data professionals wishing to know more about SQL Server functions, and wanting to get a greater understanding of which functions are available and how they can help improve the efficiency and facility of SQL routines. Using system functions can help remove unwanted inefficiencies in code, such as loops, nested loops and string-based field manipulation. The functions available in SQL Server can tie in well with business logic, enabling the developer to write better-suited algorithms to support business requirements.

Solution

Rather than give an overview of all system functions, this article will walk through three several little-used functions in SQL Server - one for statistical aggregation (VARP), one for English phonics (SOUNDEX), and one for ranking and grouping (NTILE). It will also provide an overview of the RANK and DENSE_RANK functionality to complement the exploration of NTILE. Hopefully by seeing how useful just a few functions can be, this article will encourage you to find out more about system functions in SQL Server, and at the bottom of the article are a number of links to the Books Online documentation on non-user-defined functions.

As every data professional knows, some functions are used far more than others - commonly SUM and COUNT. However SQL Server ships with a wide variety of different functions, and these can be used in many different contexts. For those of you of a mathematical or statistical persuasion, it may interest you to know that SQL Server supports a wide variety of statistical aggregation and calculation functions, together with a number of mathematical and algebraic functions that can assist in, for example finance calculations, often taking the place of more complex and inefficient code.

VARP

Consider the following example to calculate the variance of a set of temperature values. Here we first define a test table with some temperature readings taken on different occasions for some major cities:

CREATE TABLE dbo.temperatureTable (
 uqid INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
 location VARCHAR(10),
 locationCode TINYINT,
 TemperatureA FLOAT,
 TemperatureB FLOAT,
 TemperatureC FLOAT,
 TemperatureD FLOAT )
 
INSERT INTO dbo.temperatureTable ( location, locationCode, TemperatureA, TemperatureB, TemperatureC, TemperatureD ) VALUES ( 'London', 3, 19.6, 8.2, 16.4, 11.1 ), ( 'Paris', 7, 10.5, 20.2, 21.3, 18.0 ), ( 'Madrid', 4, 31.1, 28.3, 26.3, 28.7 ), ( 'Moscow', 9, -13.5, -12.0, -16.9, -0.3), ( 'Luxembourg', 6, 8.3, 5.5, 7.1, 9.9 )

Let us now say that we wish to compute the average deviation from the mean for each city in our test table. That is, we wish to find out how far away from the average, on average, each temperature reading is within the appropriate category (location). However, we cannot simply calculate the deviation by taking each reading and subtracting or adding the mean, since this will give us a range of values that are mixed positive and negative and by definition will cancel each other out. We must instead use a different method of measuring the deviation. There are three commonly used - absolute deviation, which is the deviation from the mean with the sign (minus or plus) disregarded; the variance, in which each deviation is squared to obtain a positive number (be aware that -x^2 = x^2 for all positive natural numbers x in X) - and standard deviation, which is simply defined as the average deviation, as computed by variance (in other words, the square root of the variance).

So to be more precise, variance, mathematically, is defined as the sum of the squared distances of each member of a set of numbers from the mean (average) of that set of numbers, divided by the number of members. So, in the example above, we might wish to know the mean temperature in Moscow based on the information available, and the variance of this set of data as our chosen measure of variation in temperature. Because of the peculiarity of my example having a set of data spread across columns rather than in a single column, we have an opportunity to first UNPIVOT the data then use AVG and VARP to get this information, like so:

SELECT location, AVG(Temperature) [averageTemp], ROUND(VARP(Temperature),2) [varTemp]
FROM (
  SELECT location, TemperatureA, TemperatureB, TemperatureC, TemperatureD
  FROM dbo.temperatureTable ) p
UNPIVOT 
 ( Temperature FOR City IN 
  ( TemperatureA, TemperatureB, TemperatureC, TemperatureD ) ) AS u
GROUP BY location


data professionals wishing to know more about SQL Server functions

As you can see from this result, it might be better to use STDEV (or STDEVP) to gain a more accurate measure of the true variance, as 39.03 is clearly a poor figure to indicate variance in degrees Celsius, since it exceeds the range of values. In statistics, for precisely this reason, standard deviation is the preferred variance measure as the square root of the variance provides a linear relationship instead of an exponential one. It's worth noting that SQL Server exhibits some odd behavior here - although SQRT(VARP(expression)) is logically equivalent to STDEVP(expression), interestingly and I suspect due to rounding, truncation or arithmetic errors to do with SQRT, the two values are NOT identical when computed in SQL Server - they vary by a small amount.

Variance still has its uses in mathematical circles, for example when working out the total variance of uncorrelated variables, and other matters familiar to those versed in discrete probability theory. For example, one may wish to identify statistical outliers in the temperature information given - if we had a hundred different readings and knew that the typical verified variance is approximately 39deg for Moscow, any figure significantly different from this would raise alarm bells. Standard deviation would not do, since the differences are linear and may be missed when statistically aggregating the individual variances as the difference between the expected and actual standard deviations would be too small - whereas the exponential nature of variance, obtained using VARP, would quickly distort the 39deg figure into an unlikely number, rendering it identifiable to the statistical researcher.

SOUNDEX

Moving away from statistics, the SOUNDEX function is an interesting example of a function that exclusively implements a third-party specification, a proprietary algorithm developed and patented privately nearly a hundred years ago. SOUNDEX is a function built by Microsoft to a precise algorithmic specification. The Soundex specification is designed for use in systems where words need to be grouped by phonic sound rather than by spelling - for example, in ancestry and genealogy, the surnames Smith, Smyth and Smithe are spelled differently but could be pronounced the same. It is important for users of expert systems that deal in phonics to be able to recognize these similarities without complex and inefficient rule based systems to slow down the storage and retrieval process.

A good use of Soundex could be to assist in the automatic detection of fraud. Some time ago, I used to work as a database administrator for a small company specializing in the provision of online gambling to US customers. This company would use a number of weighted measures built as stored procedures which would take many different factors, from postcode to IP address to demographic, to make an intelligent decision on whether a new customer signup was likely a duplicate of an existing account, in order to help prevent fraud. Although SOUNDEX wasn't used in this context, there would have been a strong case for inclusion as a fuzzy matching algorithm to determine the similarity of the 'John Smith's to the 'Johann Smythe's.

In summary, Soundex is a specification originally designed in 1918 by Robert C. Russell and Margaret Odell, and has risen to prominence in a number of database specifications as a preferred system for categorizing and relating phonics. For the history of Soundex and for a more detailed description of the variants in use today, see the links at the bottom of the page for more information.

Example of using Soundex:

SELECT SOUNDEX('David'), SOUNDEX('Johnson'), SOUNDEX('Alison')


the SOUNDEX function is an interesting example of a function that exclusively implements a third-party specification

We could implement a SOUNDEX-based system to identify similarly-sounding names and write a stored procedure which will identify customer records based on matching phonics. Here's a brief example of how it could work. Below is the build code to generate a table with customer ID, surname and SOUNDEX code for 1000 common surnames (you can find the list attached to this article as sample_name_data.txt):

CREATE TABLE dbo.SoundexTest (
 CustomerID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
 Surname VARCHAR(20),
 Phonic VARCHAR(10) )
 
CREATE TABLE #surnames ( Surname VARCHAR(20) )
BULK INSERT #surnames FROM N'c:\del\mssqltips_190213_final\sample_name_data.txt' -- replace with your path & file INSERT INTO dbo.SoundexTest (Surname, Phonic) SELECT Surname, SOUNDEX(Surname) FROM #surnames
SELECT * FROM dbo.SoundexTest

the build code to generate a table with customer ID, surname and SOUNDEX code for 1000 common surnames

Now you could write a stored procedure to find names with similar sounds, like so:

CREATE PROCEDURE dbo.checkSimilarNames ( @surname VARCHAR(20) )
AS BEGIN
 DECLARE @phonic VARCHAR(10)
 SET @phonic = ( SELECT TOP 1 Phonic
     FROM dbo.SoundexTest
     WHERE Surname = @surname )
 IF @phonic IS NULL
  RAISERROR('Surname not found!',10,1)
 ELSE
 BEGIN
  SELECT Surname
  FROM dbo.SoundexTest 
  WHERE Phonic = @phonic
 END
END

We test it like this:

EXEC dbo.checkSimilarNames @surname = 'SMITH'

With this result:

write a stored procedure to find names with similar sounds

As you can see, with some development a powerful phonics comparison system can easily be built.

RANKING FUNCTIONS

The final section of this article will deal with ranking functions, which includes the RANK, DENSE_RANK and NTILE functions and in the author's humble opinion, underused in most SQL Server database implementations. As a database administrator, much of my job involves reviewing and correcting code, examining existing SQL schemas, procedures and functions, and improving performance wherever possible. Given the usefulness of RANK, DENSE_RANK and NTILE, I am surprised that (at least in my experience) ranking functions are not used more often.

RANK

I have included the table build code below for your convenience:

CREATE TABLE [dbo].[rankingTable](
 [CustomerID] [int] NOT NULL,
 [CustomerClass] [varchar](6) NULL,
 [TotalDeposited] [money] NULL,
 [TotalWithdrawn] [money] NULL,
 [CurrentBalance] [money] NULL,
UNIQUE NONCLUSTERED 
(
 [CustomerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

INSERT INTO dbo.RankingTable (CustomerID, CustomerClass, TotalDeposited, TotalWithdrawn, CurrentBalance) SELECT 1270, 'Bronze', 300.00, 0.00, 285.00 UNION ALL SELECT 3243, 'Bronze', 15.00, 0.00, 122.00 UNION ALL SELECT 4083, 'Silver', 100.00, 25.00, 255.00 UNION ALL SELECT 2349, 'Gold', 50.00, 0.00, 47.00 UNION ALL SELECT 9343, 'Silver', 175.00, 50.00, 22.00 UNION ALL SELECT 5434, 'Bronze', 20.00, 0.00, 0.00 UNION ALL SELECT 2343, 'Silver', 1500.00, 500.00, 134.00 UNION ALL SELECT 1194, 'Gold', 0.00, 0.00, 0.00 UNION ALL SELECT 5639, 'Bronze', 40.00, 10.00, 95.00 UNION ALL SELECT 2030, 'Bronze', 120.00, 0.00, 105.00 UNION ALL SELECT 3424, 'Bronze', 20.00, 0.00, 5.00 GO

Consider the following (fictional) horrible and inefficient piece of code to rank customers in a table by £ (or $) amount gambled.

DECLARE cur_ForEachClass CURSOR LOCAL FAST_FORWARD
FOR SELECT DISTINCT CustomerClass
 FROM dbo.rankingTable rt

DECLARE @class VARCHAR(6)
OPEN cur_ForEachClass FETCH NEXT FROM cur_ForEachClass INTO @class WHILE @@FETCH_STATUS = 0 BEGIN INSERT INTO @resultsTable SELECT CustomerID, CustomerClass, TotalDeposited, ROW_NUMBER() OVER ( ORDER BY TotalDeposited DESC ) [Ranking] FROM dbo.rankingTable WHERE @class = CustomerClass FETCH NEXT FROM cur_ForEachClass INTO @class END SELECT * FROM @resultsTable

The final section of this article will deal with ranking functions

As you can see, this is inefficient and bulky. It pre-separates out the source data into the distinct classes, orders the contents of those classes by TotalDeposited, and glues them back together in a results table. The output is identical to this slimline version using RANK():

SELECT  CustomerID, CustomerClass, TotalDeposited,
  RANK() OVER (PARTITION BY CustomerClass ORDER BY TotalDeposited ) AS [Ranking]
FROM  dbo.rankingTable

As you can see, this is easier to read. It doesn't use a cursor and the sorting and ordering are taken care of internally. If you're unused to ranking functions, note that PARTITION BY x means 'partition each set of ranks by distinct x' and ORDER BY x determines the order in which the values must be ranked. So in the example above, PARTITION BY CustomerClass means 'return ranks from 1 to N for each distinct class (gold, silver, bronze)' and ORDER BY TotalDeposited DESC means '... and the ranking should run from largest to smallest.'

DENSE_RANK

DENSE_RANK is the second main ranking function and is similar to RANK except that gaps in the ranking are not allowed. In essence, when there is a tie-break in the conditions for the RANK then the ranking 'skips' one value and continues on. DENSE_RANK does not do this but continues immediately after the tie. This is illustrated by adding another line to our example:

INSERT INTO dbo.rankingTable VALUES ( 3424, 'Bronze', 20.00, 0.00, 5.00 );

This will insert another row with the TotalDeposited value at 20.00, which means it will tie-break with the row with CustomerID 5434 as seen below:

SELECT  CustomerID, CustomerClass, TotalDeposited,
  RANK() OVER (PARTITION BY CustomerClass ORDER BY TotalDeposited ) AS [Ranking]
FROM  dbo.rankingTable

We can avoid this by using DENSE_RANK as follows:

SELECT  CustomerID, CustomerClass, TotalDeposited,
  RANK() OVER (PARTITION BY CustomerClass ORDER BY TotalDeposited ) AS [Ranking]
FROM  dbo.rankingTable

DENSE_RANK

NTILE

This brings us nicely onto NTILE. Think of NTILE as 'N'-TILE where N refers to the 'segments' of the whole, in the same sense of the words QUARTile or PERCENTile. This is how to use it:

SELECT  CustomerID, TotalDeposited, 
 NTILE(4) OVER (ORDER BY TotalDeposited) AS [Quartile]
FROM dbo.rankingTable

This brings us nicely onto NTILE

In the above example, we have taken all values irrespective of player class and returned them in the specified order, arranged into evenly-distributed groups. We may therefore use this data to say, 'Players 1194, 3243 and 5434 are in the top 25% of players who have contributed the most money to our enterprise'. This function is very useful when e.g. computing performance on a bell curve (evenly-distributed curve). The definition according to BOL may make more sense here:

'Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.' - http://msdn.microsoft.com/en-us/library/ms175126.aspx

NTILE can therefore be useful when, for example, creating procedures that execute groups of statements in batch, or when you wish to aggregate financial figures.

Next Steps

I hope this short article has given you inspiration to check out some of the other 'forgotten' functions of SQL Server. Below you'll find a list of links to related resources, including the BOL articles from Microsoft on ranking functions, and other articles covering similar topics. Please feel free to leave comments, good or bad, below, and I'll try to respond as soon as possible.

SUGGESTED READING



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Derek Colley Dr. Derek Colley is a UK-based DBA and BI Developer with more than a decade of experience working with SQL Server, Oracle and MySQL.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2013-03-29

Comments For This Article




Friday, February 12, 2016 - 12:10:15 PM - arodrigues Back To Top (40670)

Nice tips, very usefull ;)

DENSE_RANK() code example needs update: it shows RANK() instead of DENSE_RANK().

 


Wednesday, January 14, 2015 - 7:31:03 AM - Dave Back To Top (35936)

Sorry to nit-pick but when you show the DENSE_RANK sample code you actually use the same code as the RANK sample. Just thought I'd t you know so you can correct it if you wish.

Thanks for the article! :D

 

Dave


Tuesday, March 4, 2014 - 10:39:16 AM - f ornelas Back To Top (29638)

Just an aside comment on your interpretation of the ntile results. The numbers that you cite according to the table actually show the least amount of deposits. Thus ranking in descent order may have listed the customer IDs with the most deposits. Unless I am missing something here.

 


Tuesday, March 4, 2014 - 10:19:26 AM - f ornelas Back To Top (29636)

 

Nice article, now I can change my code to use varp instead of taking the root of the standard deviation. I used ntile in my code too.


Wednesday, November 13, 2013 - 9:31:52 AM - Steve Hoyer Back To Top (27482)

Nice! Read this Tuesday, used it Wednesday! I had to compare Vendors by ID and Name across two different systems. Of course a straight string compare finds no match when you have "Alda, Allen" and "Alda, Allan", but using SOUNDEX...ding!

Thanks for the reminder.


Monday, August 26, 2013 - 11:42:53 AM - Double Metaphone Back To Top (26485)

I'm going to have to agree 100% with Joe Celko; SOUNDEX is about as close to worthless as you can find in pronunciation based matching - see Jeff Kunkel's example.

Double Metaphone, on the other hand, is quite reasonable.  There's some T-SQL at

http://www.sqlservercentral.com/scripts/Miscellaneous/30219/

which could use some tweaks - I believe it has a 5 character limit, which I personally would increase for longer names/words.

 

James, all pronunciation based algorithms only support one language (and one variant - in this case, U.S. English vs. British English vs. Australian English vs. Canadian English are likely different in subtle or not so subtle ways as well), since you first need to use one alphabet, and then you need one set of pronunciation rules.


Thursday, July 18, 2013 - 2:41:40 PM - Derek Colley Back To Top (25900)

Everyone - thanks for your comments and for pointing out the errors.  If I revisit this article I'll revise it to include your suggestions.  I appreciate your feedback!


Wednesday, July 17, 2013 - 1:46:52 PM - Mike Hayes Back To Top (25882)

I found a couple of issues:

 

1:  When I tried to execute the last INSERT in the article (INSERT INTO dbo.rankingTable VALUES ( 3424, 'Bronze', 20.00, 0.00, 5.00 )),  I got the following error:  

Violation of UNIQUE KEY constraint 'UQ__rankingT__A4AE64B948BAC3E5'. Cannot insert duplicate key in object 'dbo.rankingTable'. The duplicate key value is (3424).

 

2:  The query result displayed below the last NTILE query is incorrect.  The maximum value in the result set for the Quartile column should be 4.

 

Otherwise,  a good article.


Friday, June 14, 2013 - 6:09:32 AM - Tim Parrott Back To Top (25434)

Thanks for this information. I will use RANK and will stop using ROW_NUMBER() and then sorting on it!


Thursday, April 4, 2013 - 10:42:56 PM - James Back To Top (23177)

Does SOUNDEX only support English? Sort of useless in my opinion.

 


Thursday, April 4, 2013 - 9:48:37 AM - Jeff Kunkel Back To Top (23159)

Very good job on the article! I had a friend point this out to me almost 20 years ago. The SOUNDEX function was available back in FoxPro! It is puzzling that SOUNDEX has not been changed in that long...

Have you tried the following? It is funny if you know an ill-tempered lady named Brenda.

SELECT SOUNDEX('Brenda'), SOUNDEX('Broomwitch')

Again, good job on the article!

 


Wednesday, April 3, 2013 - 4:59:06 AM - dioscoredes Back To Top (23131)

Couple of coding niggles - typos probably.  But did I learn from this - well, yes, very much so.  Asa  production DBA the refinements of development TSQL do slip away with disuse so is nice to be hauled back on to the tracks from time to time.


Friday, March 29, 2013 - 2:26:44 PM - Chris W Back To Top (23075)

You have an image of the Temp data below the "(fictional) horrible and inefficient piece of code to rank customers "  and you have an image of Soundex data below this box "

SELECT  CustomerID, CustomerClass, TotalDeposited,
  RANK() OVER (PARTITION BY CustomerClass ORDER BY TotalDeposited ) AS [Ranking]
FROM  dbo.rankingTable

"

 

Otherwise, great article.


Friday, March 29, 2013 - 9:26:35 AM - Joe Celko Back To Top (23074)

Soundex is a horrible legacy, but everyone has it. There several better phonetic matching algorithms and we nee to start using them. The original creators did an improved version later. Metaphone is much better as is NYSIIS. 















get free sql tips
agree to terms