Calculating Median Absolute Deviation with T-SQL Code in SQL Server

By:   |   Updated: 2016-09-01   |   Comments   |   Related: > TSQL


Problem

We're currently analyzing a data set and using median to find the central point of the data, but we think that based on some extreme values, this may not actually be the central point. We think that median is useful, but it may be limited for our data set; is there any way that we can limit the extreme ranges for data points and find the central point after that?

Solution

Median offers a robust measure in statistics when needing to find a precise measuring point or range for a data set, but it may not be helpful if there are extreme high points and low points and one is trying to find a central tendency. In this example, we'll look at historic natural gas future prices since it has historically had short and extreme price increases and one way to look at these data is to filter these out from the median. We'll be using the MAD - or median absolute deviation - as an additional filter to make sure that we calculate an accurate median price after removing these extreme prices. For a related tip using average, we previously looked at removing outliers from a data set (removing them on the basis of average).

If using the latest edition of SQL Server (2016) with R, you can use R to calculate the MAD, since the function is built in (the same is true with Oracle). Additionally, we'll make the same assumptions as the linked tip covering R about a normally distributed data set. This does not mean that you would want to use this with natural gas futures, but that we'll be using this assumption only for the sake of this tip.

First, we need to know the median and there's a great tip covering calculating median. In this tip, I will use a common table expression because I prefer them for readability when troubleshooting or analyzing a data set; however, we can build a function or stored procedure like the other tips mention for re-use. The median:

;WITH GetMedian AS(
 SELECT 
  ROW_NUMBER() OVER (ORDER BY NaturalGasPrice DESC) AS MdnCnt
  , CASE
   WHEN ((SELECT COUNT(*) FROM tblNaturalGas) % 2 = 0) THEN ((SELECT COUNT(*) FROM tblNaturalGas)/2)
   ELSE (((SELECT COUNT(*) FROM tblNaturalGas)/2)+1)
  END AS Median
  , NaturalGasPrice
 FROM tblNaturalGas
)
SELECT 
 NaturalGasPrice
FROM GetMedian
WHERE MdnCnt = Median

---- Result: 3.330000

If I stuck with the result here, I might be tempted to think that the central price for natural gas futures is $3.33; but what if I wanted to exclude those extreme values and then calculate the median? The steps to calculating the MAD are:

  1. Calculate the median value of the data set.
  2. Subtract the median from each value in the data set, returning the absolute value of these subtracted values.
  3. Get the new median from the absolute values of the ordered result.
  4. Multiply the median by the assumed constant relative to the distribution of our data set. In this example, we will use the normal distribution assumption of 1.4826.
  5. Like outliers, I use the assumption here of 3 median absolute deviations and since no price is negative, I exclusively multiply the result in step four by positive three.
  6. Finally, using the value in step five, filter out all values in the full data set above that threshold (since we're only using the positive value) and calculate the median without those values included in the data set.

Let's look at this step-by-step in T-SQL:

----  Variable table used for median of second result set after subtracting each value from overall median
DECLARE @mad TABLE(
 NaturalGasPrice_MAD DECIMAL(22,6)
)
-- STEP ONE
---- Variables for the median price overall, the median of second result set, and the median absolute deviation filter point
DECLARE @medianprice DECIMAL(22,6), @medianmad DECIMAL(22,6), @medianmadfilter DECIMAL(22,6)
---- Get the median price overall
;WITH GetMedian AS(
 SELECT 
  ROW_NUMBER() OVER (ORDER BY NaturalGasPrice DESC) AS MdnCnt
  , CASE
   WHEN ((SELECT COUNT(*) FROM tblNaturalGas) % 2 = 0) THEN ((SELECT COUNT(*) FROM tblNaturalGas)/2)
   ELSE (((SELECT COUNT(*) FROM tblNaturalGas)/2)+1)
  END AS Median
  , NaturalGasPrice
 FROM tblNaturalGas
)
SELECT 
 @medianprice = NaturalGasPrice
FROM GetMedian
WHERE MdnCnt = Median
---- Output that value
SELECT @medianprice AS Median_NaturalGasPrice_NoMADFilter

-- STEP TWO
---- Subtract the median price from each value and take the absolute value of each of these points
INSERT INTO @mad
SELECT 
 ABS(NaturalGasPrice - @medianprice)
FROM tblNaturalGas

-- STEP THREE
---- Get the median of the absolute value from the overall median price subtracted from each data point
;WITH GetMedianForMad AS(
 SELECT 
  ROW_NUMBER() OVER (ORDER BY NaturalGasPrice_MAD DESC) AS MdnCnt
  , CASE
   WHEN ((SELECT COUNT(*) FROM @mad) % 2 = 0) THEN ((SELECT COUNT(*) FROM @mad)/2)
   ELSE (((SELECT COUNT(*) FROM @mad)/2)+1)
  END AS Median
  , NaturalGasPrice_MAD
 FROM @mad
)
SELECT 
 @medianmad = NaturalGasPrice_MAD
FROM GetMedianForMad
WHERE MdnCnt = Median
---- Output what we have so far:
SELECT
 @medianmad AS Median_MAD 
 , (@medianmad*1.4826) AS Median_MAD_NormalDistribution  ---- Assuming normal distribution; b = 1.4826 
 , ((@medianmad*1.4826)*3) AS Median_MAD_UpperRange ---- Like outliers, I generally use 3 here; however, 2.7 may be more appropriate for natural gas futures

-- STEP FOUR and FIVE
 ---- Save the median absolute deviation filter point
SELECT
  @medianmadfilter = ((@medianmad*1.4826)*3)

-- STEP SIX
---- What is the median when we filter out the extreme values based on the MAD?
;WITH GetMedian_FilterMAD AS(
 SELECT 
  ROW_NUMBER() OVER (ORDER BY NaturalGasPrice DESC) AS MdnCnt
  , CASE
   WHEN ((SELECT COUNT(*) FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter) % 2 = 0) THEN ((SELECT COUNT(*) FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter)/2)
   ELSE (((SELECT COUNT(*) FROM tblNaturalGas WHERE NaturalGasPrice <= @medianmadfilter)/2)+1)
  END AS Median
  , NaturalGasPrice
 FROM tblNaturalGas
 WHERE NaturalGasPrice <= @medianmadfilter
)
SELECT 
 NaturalGasPrice AS MedianNaturalGasPrice_FilteredByMAD
FROM GetMedian_FilterMAD
WHERE MdnCnt = Median

Result from the above code:

median absolute deviation sql server

If I assume that $3.33 doesn't tell me about the price of natural gas futures based on the data set that I have and I'm aiming to find the most central price, I can look at using the above to calculate the absolute deviation surrounding the median. The main key is what I believe about whether my data set has extreme values, or whether I think it's a good distribution. In most cases, I will calculate the median of both and compare, then look at the data set and re-evaluate the extreme values.

Next Steps
  • Consider an example of median household income in the United States, which is currently $53,600 a year. If we organized all the data without the extremely expensive cities, how different or similar would the medians be?
  • Remember, that median is a precise measure already, so when we consider additional filtering we should be careful about over-analysis.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2016-09-01

Comments For This Article

















get free sql tips
agree to terms