Skip to content
June 9, 2011 / Rohit

Streaming Standard Deviation and Variance

Recently I have been doing a bit of profiling. I usually run the program through the time command. Rather than just timing it once it’s better to have the average time over many runs. Say I want to time 10 runs of my program, I would run:

for i in `seq 1 10`; do time myprogram; done

After which I want to average the running time. It’s fairly straightforward to extract, say the wall clock times from the output and pipe it to a program which computes the mean. Unfortunately I couldn’t find any Unix utility which just reads numbers from its input and prints out the average. Or in other words a program which computes the mean on streaming data. Something which would let me do:

for i in `seq 1 10`; do time myprogram; done 2>&1| grep 'real' | mean

So I had to write one. My first program was pretty straightforward and looked like:

while(scanf("%f", &x) != EOF) {
    xsum_t += x;
    t++;
}
printf("%f", xsum_t/t);

But what if I wanted to also display the variance and standard deviation? Remember the variance is given by the formula:

\sigma^2 = \frac{\sum (X_i - \bar{X})}{N}

Standard deviation is just the square root of the variance:

\sigma = \sqrt{\frac{\sum(X_i - \bar{X})}{N}}

A wasteful way of computing variance and standard deviation would be to maintain a dynamic array which would store each new number as it arrives. Once you encounter EOF, you run through the array and calculate the mean, standard deviation and variance using the above formulas.

At first glance it’s not easy to see (at least for me) how it’s possible to compute the variance for streaming data. But turns out it’s not really that difficult. As is typical of math, if you do a little bit of simplification the formula for variance can be written in another way:

\sigma^2 = \frac{\sum X_i^2}{N} - \bar{X}^2

The first term in the equation is the mean of the squares of the number and the second term is the square of the mean of the numbers. Now, we already know that computing mean is pretty straightforward. All we need to do to compute the variance (and consequently standard deviation) is to keep track of the mean of the squares of each number — which is pretty simple.

The final program looks like this:

  int t=1;
  while(scanf("%f", &x) != EOF) {
	  xsum_t += x;
	  xsqsum_t += x*x;
	  // mean of t numbers
	  xbar_t = xsum_t/t;
	  // mean of t squared numbers
	  xsqbar_t = xsqsum_t/t;
	  //variance of t numbers
	  xvar_t = xsqbar_t - square(xbar_t);
	  //standard deviation of t numbers
	  xstddev_t = sqrt(xvar_t);
	  t++;
  }
  printf("%f %f %f\n", xbar_t, xvar_t, xstddev_t);

Now that it’s so obvious, I feel a lot dumber.

Advertisements

2 Comments

Leave a Comment
  1. engineer / Nov 16 2011 2:05 pm

    thanks – helpful formula

  2. free line sticker download / Sep 29 2014 5:17 am

    I feel that is one of the such a llot significant infrormation for me.
    And i am satisfied studying your article. But wanna commentary on some general issues,
    The site taste is great, the articles iis really nice
    : D. Excellent task, cheers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: