Variance is a key statistical topic. It lets us know how close to the mean the data is clustered. Consider, if 2 classes have a mean exam score of 40%; does that mean that all the students in both classes performed equally as well/badly? The answer is, probably not – one class may have had some extremely good performers which pulled the average for the rest of the class up. While the other class may have all of their students closely clustered around the mean.
The range is super simple to calculate and is really not that useful. It’s not useful because it isn’t stable. It’s not resistant to minor changes in the source data – we just need one outlier & the range will change drastically.
To calculate the range, we simply take the maximum value in the dataset and subtract the minimum value in the dataset.
Variance and Standard Deviation
Variance and standard deviation both measure the spread of the data. It’s about measuring the variability (volatility) from the mean.
The standard deviation is calculated as the square root of the variance. This means, the standard deviation is expressed in the same units as the mean while variance is expressed in squared units.
A distribution with a mean of 10 and a standard deviation of 3 is the same as a distribution with a mean of 10 and a variance of 9.
Why is standard deviation useful?
Imagine we have two classes. One got the scores: 10, 10, 10, 30, 30, 30 and the other got 20, 20, 20, 20, 20, 20. Both classes have a mean of 20. But one has a standard deviation of zero and the other does not. This tells us how spread the data is.
What is a z-score?
A z-score is how many sigmas (standard deviations) above the mean a datapoint is.
Formulas for sample distributions
The formula for sample variance and sample standard deviation are:
Formulas for populations:
The formula for population variance and population standard deviation are:
Let’s look at an example. In the below, we:
Coefficient of variation
The coefficient of variation (or CV for short). shows how much data varies from the mean as a percentage. The formula to calculate this is:
In the below, we take two datasets. One which has a standard deviation of 2.58 with a mean of 4 and the other which has a deviation of 3 with a mean of 6. The higher the output percentage, the more the data deviates from the mean. You can see therefore that the orange formula has the highest deviation.
Chebyshevs theorem states that at least 1 − 1/k2 of the distribution’s values are within k standard deviations of the mean.
So, what does that mean. Well, using the formula below, we can determine the percentage of values that will fall within each standard deviation from the mean.
In the below, we can say that 75% of all values will fall between the mean + 2 sd and the mean – 2 sd.
In the below, I’ve calculated the % of data that will fall within 2, 3 and 4 standard deviations from the mean and provided an example where the mean is 100 and the standard deviation is 5. So, we can say that 75% of values will fall between 90 and 110.
Chebyshev’s theorem isn’t used a whole lot in practice. However, the purpose of creating upper and lower limits around the mean is used quite heavily. Note: this theorem can work for any type of distribution.