Developer Drain Brain

September 25, 2013

Incrementally calculating Mean, Variance and Standard Deviation in T-SQL

Filed under: Uncategorized — rcomian @ 4:20 pm

4174928680_7c95b42ed1
So, today I had a performance issue. I found a piece of SQL that inserted a lot of data in to a table, one row at a time.
For each row, it inserted a random value into the table, then in to another table inserted the mean and standard deviation of all the values up to that point. In T-SQL the way to do this is quite straightforward:

INSERT INTO @value_table VALUES (@value)
INSERT INTO @stats_table VALUES ( SELECT SUM(val) / COUNT(val) AS mean, STDEV(val) as stdev FROM @value_table )

Of course, the problem with code like this is that calculating the statistics iterates over every existing value each time it’s called – and it’s called for every row that’s inserted.
For me, it started to creak after just a few hundred thousand rows – and my machine got very warm.

The way to solve this sort of problem is to calculate the stats incrementally – that is save the current state in variables and just add the next number in each time rather than going and recalculating the whole lot each time.

Incrementally calculating the mean is quite straightforward, we keep a @sum variable to which we add the @value and a @count variable which we increment by 1 each time.

This gives us something like the following:

INSERT INTO @value_table VALUES (@value)
SET @sum += @value
SET @count += 1
INSERT INTO @stats_table VALUES ( SELECT @sum / @count AS mean, STDEV(val) as stdev FROM @value_table )

That deals with the mean, so how to we calculate the standard deviation like this?
Well, first of all, I had to remind myself what the standard deviation was. Thanks to Standard Deviation and Variance page on Maths is Fun, I found that the standard deviation is just the square root of the variance.
Which is a good start. It means we can calculate the variance incrementally and then get the square root of that value as the variance.
But what’s a variance? High-school maths was a long time ago, but it is, of course, the sum of the square of differences between the value and the mean.
Well we know the mean for each point, but things are a little tricky since the mean changes for every data point we add in – and we need to calculate the difference between the each datapoint and the last mean value, which implies we need to go back to revisit each data point with the new mean each time and recalculate everything.

But there must be some hope, the new mean changes from the old mean in a measurable way, we must be able to take advantage of that to recalculate everything.

Fortunately, it turns out someone who actually knows math has already worked this out, thanks to the Maths Stack Exchange site, someone had already asked and answered that question.

Now, when I went to that link, I got a little scared. I really wasn’t sure what all those numbers actually represented, but if you read it carefully, you can parse it out.

But how do you go about coding this and checking that your calculation is correct?

Well, it took a little fiddling, but I finally managed to pull something together, so I give you – incrementally calculating the Mean and Standard Deviation in T-SQL and checking that the values are correct by also calculating it longhand:

DECLARE @generating TABLE (val REAL)

DECLARE
@vals_Count INT = 0,
@vals_Sum REAL = 0,
@vals_Mean REAL = 0,
@vals_Variance REAL = 0,
@vals_LastMean REAL = 0,
@vals_LastVariance REAL = 0,
@nextVal REAL = 0

— Define the values that we’re going to insert here
DECLARE @toinsert TABLE(id int identity(1,1), w REAL)
INSERT INTO @toinsert (w)
SELECT 600 UNION
SELECT 470 UNION
SELECT 170 UNION
SELECT 430 UNION
SELECT 300

DECLARE @i INT, @max INT
SELECT @i = 0, @max = max(id) FROM @toinsert

WHILE @i < @max
BEGIN

SET @i += 1
SELECT @nextVal = w FROM @toinsert WHERE id = @i

— Add this value to our table
INSERT INTO @generating VALUES(@nextVal)

— Save the last mean and variance values
SET @vals_LastMean = @vals_Mean
SET @vals_LastVariance = @vals_Variance

— Incrementally calculate the new mean
SET @vals_Count += 1
SET @vals_Sum += @nextVal
SET @vals_Mean = @vals_Sum / @vals_Count

— Incrementally calculate the new variance (if you do this when count = 1 you get a divide by zero error, probably because the concept is meaningless)
IF (@vals_Count > 1)
BEGIN
SET @vals_Variance = (((@vals_Count-2) * @vals_LastVariance) + ((@vals_Count-1) * POWER((@vals_LastMean – @vals_Mean), 2)) + (POWER(@nextVal – @vals_Mean, 2))) / (@vals_Count-1)
END
ELSE
BEGIN
SELECT @vals_Variance = 0
END

— Select the
SELECT @vals_Count as id, @nextVal as inserted, SUM(val) / COUNT(val) as mean, STDEV(val) as stdev, @vals_Mean as ‘incMean’, SQRT(@vals_Variance) as incStdev FROM @generating

END

Advertisements

Create a free website or blog at WordPress.com.