So, today I had a performance issue. I found a piece of SQL that inserted a lot of data in to a table, one row at a time.

For each row, it inserted a random value into the table, then in to another table inserted the mean and standard deviation of all the values up to that point. In T-SQL the way to do this is quite straightforward:

INSERT INTO @value_table VALUES (@value)

INSERT INTO @stats_table VALUES ( SELECT SUM(val) / COUNT(val) AS mean, STDEV(val) as stdev FROM @value_table )

Of course, the problem with code like this is that calculating the statistics iterates over every existing value each time it’s called – and it’s called for every row that’s inserted.

For me, it started to creak after just a few hundred thousand rows – and my machine got very warm.

The way to solve this sort of problem is to calculate the stats incrementally – that is save the current state in variables and just add the next number in each time rather than going and recalculating the whole lot each time.

Incrementally calculating the mean is quite straightforward, we keep a @sum variable to which we add the @value and a @count variable which we increment by 1 each time.

This gives us something like the following:

INSERT INTO @value_table VALUES (@value)

SET @sum += @value

SET @count += 1

INSERT INTO @stats_table VALUES ( SELECT @sum / @count AS mean, STDEV(val) as stdev FROM @value_table )

That deals with the mean, so how to we calculate the standard deviation like this?

Well, first of all, I had to remind myself what the standard deviation *was*. Thanks to Standard Deviation and Variance page on Maths is Fun, I found that the standard deviation is just the square root of the variance.

Which is a good start. It means we can calculate the variance incrementally and then get the square root of that value as the variance.

But what’s a variance? High-school maths was a long time ago, but it is, of course, the sum of the square of differences between the value and the mean.

Well we know the mean for each point, but things are a little tricky since the mean changes for every data point we add in – and we need to calculate the difference between the each datapoint and the *last* mean value, which implies we need to go back to revisit each data point with the new mean each time and recalculate everything.

But there must be some hope, the new mean changes from the old mean in a measurable way, we must be able to take advantage of that to recalculate everything.

Fortunately, it turns out someone who actually knows math has already worked this out, thanks to the Maths Stack Exchange site, someone had already asked and answered that question.

Now, when I went to that link, I got a little scared. I really wasn’t sure what all those numbers actually represented, but if you read it carefully, you can parse it out.

But how do you go about coding this and checking that your calculation is correct?

Well, it took a little fiddling, but I finally managed to pull something together, so I give you – incrementally calculating the Mean and Standard Deviation in T-SQL and checking that the values are correct by also calculating it longhand:

DECLARE @generating TABLE (val REAL)

DECLARE

@vals_Count INT = 0,

@vals_Sum REAL = 0,

@vals_Mean REAL = 0,

@vals_Variance REAL = 0,

@vals_LastMean REAL = 0,

@vals_LastVariance REAL = 0,

@nextVal REAL = 0— Define the values that we’re going to insert here

DECLARE @toinsert TABLE(id int identity(1,1), w REAL)

INSERT INTO @toinsert

SELECT 600 UNION

SELECT 470 UNION

SELECT 170 UNION

SELECT 430 UNION

SELECT 300DECLARE @i INT, @max INT

SELECT @i = 0, @max = max(id) FROM @toinsertWHILE @i < @max

BEGINSET @i += 1

SELECT @nextVal = w FROM @toinsert WHERE id = @i— Add this value to our table

INSERT INTO @generating VALUES(@nextVal)— Save the last mean and variance values

SET @vals_LastMean = @vals_Mean

SET @vals_LastVariance = @vals_Variance— Incrementally calculate the new mean

SET @vals_Count += 1

SET @vals_Sum += @nextVal

SET @vals_Mean = @vals_Sum / @vals_Count— Incrementally calculate the new variance (if you do this when count = 1 you get a divide by zero error, probably because the concept is meaningless)

IF (@vals_Count > 1)

BEGIN

SET @vals_Variance = (((@vals_Count-2) * @vals_LastVariance) + ((@vals_Count-1) * POWER((@vals_LastMean – @vals_Mean), 2)) + (POWER(@nextVal – @vals_Mean, 2))) / (@vals_Count-1)

END

ELSE

BEGIN

SELECT @vals_Variance = 0

END— Select the

SELECT @vals_Count as id, @nextVal as inserted, SUM(val) / COUNT(val) as mean, STDEV(val) as stdev, @vals_Mean as ‘incMean’, SQRT(@vals_Variance) as incStdev FROM @generatingEND

## Leave a Reply