Recently I have been doing a bit of profiling. I usually run the program through the time command. Rather than just timing it once it’s better to have the average time over many runs. Say I want to time 10 runs of my program, I would run:

for i in seq 1 10; do time myprogram; done

After which I want to average the running time. It’s fairly straightforward to extract, say the wall clock times from the output and pipe it to a program which computes the mean. Unfortunately I couldn’t find any Unix utility which just reads numbers from its input and prints out the average. Or in other words a program which computes the mean on streaming data. Something which would let me do:

for i in seq 1 10; do time myprogram; done 2>&1| grep 'real' | mean

So I had to write one. My first program was pretty straightforward and looked like:

while(scanf("%f", &x) != EOF) {
xsum_t += x;
t++;
}
printf("%f", xsum_t/t);

But what if I wanted to also display the variance and standard deviation? Remember the variance is given by the formula:

$\sigma^2 = \frac{\sum (X_i - \bar{X})}{N}$

Standard deviation is just the square root of the variance:

$\sigma = \sqrt{\frac{\sum(X_i - \bar{X})}{N}}$

A wasteful way of computing variance and standard deviation would be to maintain a dynamic array which would store each new number as it arrives. Once you encounter EOF, you run through the array and calculate the mean, standard deviation and variance using the above formulas.

At first glance it’s not easy to see (at least for me) how it’s possible to compute the variance for streaming data. But turns out it’s not really that difficult. As is typical of math, if you do a little bit of simplification the formula for variance can be written in another way:

$\sigma^2 = \frac{\sum X_i^2}{N} - \bar{X}^2$

The first term in the equation is the mean of the squares of the number and the second term is the square of the mean of the numbers. Now, we already know that computing mean is pretty straightforward. All we need to do to compute the variance (and consequently standard deviation) is to keep track of the mean of the squares of each number — which is pretty simple.

The final program looks like this:

  int t=1;
while(scanf("%f", &x) != EOF) {
xsum_t += x;
xsqsum_t += x*x;
// mean of t numbers
xbar_t = xsum_t/t;
// mean of t squared numbers
xsqbar_t = xsqsum_t/t;
//variance of t numbers
xvar_t = xsqbar_t - square(xbar_t);
//standard deviation of t numbers
xstddev_t = sqrt(xvar_t);
t++;
}
printf("%f %f %f\n", xbar_t, xvar_t, xstddev_t);

Now that it’s so obvious, I feel a lot dumber.