Before we get into the detail of statistical sampling, let’s first discuss what statistics is, why it’s required and what kind of data we can deal with in statistics.
When you’re conducting any kind of analysis, you need data. That data could be the results of a survey, for example. If we imagine that we are surveying every student in a UK university, you would need to distribute, collect and analyse 15,000+ surveys to students. It’s very time consuming, costly and quite impractical. Imagine if some students are off sick, you won’t get a true view of the population, unless you wait for them to come back to survey them.
Where it is not possible or impractical to collect and analyse an entire population worth of data, we use sampling, which involves questioning a subset (or sample) of the population and using the sample responses to make assertions about the opinion of the entire population.
Sampling can however open the door to bias and may not give us a representative view of the population. As such, we should follow some of the sampling methods described in this article.
When we are talking about a population, we call our data parameters. They’re facts, based on population data. When it’s based on a sample, we call them statistics.
Note: N is used to denote the population, while n is used to denote the sample.
In statistics, we have a few key terms surrounding the type of data too:
Quantitative data is numerical (also known as continuous) data. If you can find the mean of the data, the it’s quantitative. Quantitative data can be interval or ratio.
Interval data can have no true zero. For example, year of admission cannot be zero. Ratio data on the other hand can be – blood pressure can be zero, but you probably don’t want that…
Qualitative data is also known as categorical or discrete data & you can’t find the mean of it. It could be something like the ‘town of residence’.
Qualitative data can be nominal or ordinal. Ordinal data is data that can be ordered. For example, if you ask a customer to rate their satisfaction from 1 to 5 then those responses can be ordered. Nominal data cannot be ordered, for example country of residence has no logical order.
A few terms…
A sampling frame is a list of everyone from which your sample can be selected.
Undercoverage is where we omit population members. For example, if you ask HR for employee records for everyone in the company, but some of those people are not in the system, you will miss some people out.
Sampling error is expected error in statistics. We know and understand that the population mean will not be equal to the sample mean.
Non-sampling error is bad & we should avoid it at all costs. Non-sampling error can be caused by having a bad list (outdated list of employees for example); inaccurate measurement apparatus or sloppy data collection.
Simple random sampling is where every member of the population has an equal chance of being selected for the sample.
If for example, we were going to carry out a survey on all geography students in a university we would need to have a list of all students that study geography. We would then use a random number generator to randomly select X students for the survey.
Stratified sampling is where we divide the population into layers. For example, in our geography example, we might categorise students as first, second or third year students.
We then take a simple random sample from each stratum that we’ve defined.
In the above example, we have 10,000 students, of which only 50 are statistics students. If we were to draw a simple random sample, it’s likely that the statistics students would not be well represented.
Another example would be the different years of degree study. It’s common place that many people drop out in years 1 and 2. So if we took a simple random sample across the entire student population, we would get a skewed result, as more first year students would respond (there are more of them).
By using a stratified sample, we will get good representation from each group.
There is however a risk. In the above, if I were to take a 100 person sample from each year, I would have a greater percentage of all year 3 students than I had of year 1 students. This is called oversampling.
Systematic sampling is where we arrange the population in order (e.g. alphabetical, or by time of arrival), we then pick a random individual to start with & then sample every kth individual.
The limitation is, if your data is ordered MALE, FEMALE, MALE… then you cannot use this sampling. If you sample an even number, you will always return females, so it’s not representative.
This is however useful to sample customers as they walk into a store. You don’t require a list of the population to start and can sample customers as they arrive.
Cluster sampling is how we define what the problem is in a particular area. For example, if we suspect that living close to a factory causes cancer, we may use cluster sampling to prove our theory.
We divide the area up into clusters, analyse and compare the results. This type of sampling is biased towards the type of people that live in the area.
Multi stage sampling is where we use multiple sampling methods at once. For example:
We may take a cluster sample:
Then, take a simple random sample of just 2 counties from our clusters:
Then, take a simple random sample of some towns within the clusters:
And finally, we might take a stratified sample of schools within each town (junior, senior, etc…)