4.3 Probabilistic Sampling

In probabilistic sampling, every individual in the population has a known and non-zero chance of being selected.

Example 8 (continuation of Example 1): When Raheem collected data from students regarding their campus food preferences, he was collecting a sample, since he was not surveying every single student. There were many ways for him to go about collecting this sample. Here are some of the ways he considered:

He could have asked his niece Aaliyah, who is studying philosophy, to hand out surveys to her classmates. In this case, only philosophy students who are in Aaliyah’s class would have a non-zero chance of being selected. Engineering students, for example, would have a zero chance of being selected. Thus, this would be a non-probabilistic sample.
He could have asked one of his friends, John, who is a lecturer in accounting, to hand out surveys to his students. In this case, only accounting students in John’s class would have a non-zero chance of being selected. Philosophy students, for example, would have a zero chance of being selected. Thus, this would be a non-probabilistic sample.
He could have asked other owners of campus restaurants and food outlets to hand out surveys to their customers. In this case, students who buy food from food outlets on campus would be selected. If Raheem got the owners of all of the outlets on campus to hand out surveys, then all students who buy food on campus would have a non-zero chance of being selected. This would be a probabilistic sample, but NOT of the population Raheem is interested in. Recall that he wanted the opinions of students who do not buy food on campus. Those students would have zero chance of being selected.
He could have liaised with university management to ensure surveys were sent out to all students via email. In this case, all students would have had a chance to answer the survey. This would be a probabilistic sample.
He could have asked students to hand out surveys randomly to other students on campus. In this case, all students would, at least in theory, have had a chance to answer the survey. This would be a probabilistic sample.

Example 9: Thabang is a security manager at an airport. In order to reduce airport crime, he wants his staff to search travellers’ luggage. Since all travellers must pass through the security queues, and must also wait in the waiting area at their gate, he could search the luggage of everyone in the security queue, or everyone in the waiting area. However, this is not feasible, as it would take too much time and make people late for their flights. Thus, Thabang knows that he must take a sample of the travellers in the airport. He considers the following options:

Select all travellers whose surnames begin with an A, an F or an N.
Generate a sequence of non-repeating random numbers, e.g. 9, 24, 18, etc., and select travellers who are 9th, 24th, 18th, etc. in the security queue.
Select travellers who are suspiciously in a hurry.
Select travellers who have red suitcases.
Select every 10th traveller in the security queue.
Randomly select travellers from the waiting area at each airport gate.
Randomly select waiting areas, and search the luggage of all travellers in that waiting area.

Exercise: Discuss each of Thabang’s proposed ways to sample travellers’ luggage, and comment on whether or not this option would constitute a probabilistic sample.

Figure 4.1: Image attribution: Designed by macrovector / Freepik

4.3.1 Simple Random Sampling

A simple random sample (SRS) is obtained if each element of the population that has not yet been included in the sample, has an equal chance of being selected in the next draw.

In Example 8, Option 2 is an example of a simple random sample. Here, every person in the security queue who has not yet been selected, has an equal chance of being selected.

Suppose there are 100 people in the queue, i.e. the population size is \(N=100\). Before Thabang generates a random number, each person’s chance of being selected is \[\frac{1}{N}=\frac{1}{100}.\] Now suppose Thabang wants a sample of size \(n=10\). He generates the first random number, 9. The 9th traveller’s luggage is searched, and they are excluded from being searched again. Now, the chance of every other person in the queue being selected is \(\frac{1}{99}.\) Thabang now generates another random number (excluding the number 9), and obtains the number 24. The 24th traveller’s luggage is searched, and they are again excluded from future searches. The chance of every other person in the queue being selected (i.e. everyone except the 9th and 24th travellers) is now \(\frac{1}{98}.\)

This process is repeated until Thabang has sampled as many travellers as he decided on (e.g., 10 travellers).

The procedure to collect a simple random sample is as follows:

Number all \(N\) elements in the population.
Decide on a sample size \(n\).
Select \(n\) random numbers out of the numbers belonging to the population elements.
Select the population elements corresponding to these random numbers.

The procedure to select random numbers is as follows:

Select a random starting point from a table of random numbers.
Divide consecutive single digits into groups, where the size of the groups is the same number of digits as the population size (\(N\)). Write down each of the numbers which is less than or equal to \(N\).
Include the population elements with numbers that agree with these numbers.

4.3.2 Systematic Sampling

In a systematic sample, every \(k\)th element of the population is selected, after a random initial element is selected, where \(k=\frac{N}{n}\). Here, every element of the population has a \(\frac{n}{N}=\frac{1}{k}\) chance of being selected.

In the airport security example, Option 5 represents a systematic sample. Suppose there are now \(N=200\) travellers in the security queue, and that Thabang wants a sample of size \(n=20\). In order to take a systematic sample, he will first calculate \(k=\frac{N}{n}=\frac{200}{20}=10.\) He will then select a random number between 1 and \(k=10\), and select the corresponding traveller in the queue. Say the random number is 3. In this case, he will select the 3rd traveller. Thereafter, he will add \(k=10\) to this random number and select the corresponding traveller, i.e. the 13th traveller. He will repeat the process by selecting the 23rd, 33rd, etc. traveller until the 93rd traveller. He will then have his sample of size \(n=20\).

The procedure to collect a systematic sample is as follows:

Number all \(N\) elements in the population.
Decide on a sample size \(n\).
Calculate the ratio \(k=\frac{N}{n}\), also called the sampling interval.
Randomly select a number between 1 and \(k\) to determine the first individual in the sample.
From this starting point, select every \(k\)th individual from the list.

4.3.3 Stratified Sampling

In stratified sampling, the population is divided into subgroups (strata), and a random sample is taken from each subgroup (stratum). In the airport security example, Option 6 constitutes stratified sampling. Suppose there are \(3\) waiting areas in the airport. These waiting areas represent the strata. Suppose Area 1 has \(N_1=150\) travellers, Area 2 has \(N_2=100\) travellers, and Area 3 has \(N_3=50\) travellers currently waiting. Thus, the total population size is \(N=N_1+N_2+N_3=300\). If Thabang wants to take a sample of \(n=30\) travellers, he has two different ways to select the sample size per waiting area.

His first option is called proportional stratified sampling, and involves choosing a sample of travellers from each waiting area such that the sample size for each area is proportional to its size in the population. For each waiting area, the sample size can be calculated as \(n_h=\frac{N_h}{N}\times n, h=1,2,3\). Using this formula, he would select \(n_1=\frac{150}{300}\times 30=15\) travellers from Area 1, \(n_2=\frac{100}{300}\times 30=10\) travellers from Area 2, and \(n_3=\frac{50}{300}\times 30=5\) travellers from Area 3. Note that \(n_1+n_2+n_3=30=n\).

His second option is equal stratified sampling, where the same number of individuals is chosen from each stratum, regardless of its size. In this case, \(n_1=n_2=n_3=\frac{n}{3}=\frac{30}{3}=10.\) This kind of sampling is used when it is more important to select the same number of elements from each stratum than to ensure each stratum is represented. In this example, it could lead to Area 1 being under-represented and Area 3 being over-represented in the sample.

The procedure to collect a stratified sample is as follows:

Number all \(N\) elements in the population.
Divide the population into mutually exclusive strata. Each individual should belong to one and only one stratum.
Decide on a sample size \(n\).
Decide whether to use proportional or equal stratified sampling, and consequently calculate the appropriate sample size per stratum.
Select a random sample from each stratum using simple random sampling.

Stratified sampling is useful when each stratum is homogeneous, i.e. elements within strata are similar, but there are big differences between strata.

Definition of Homogeneous Data: Homogeneous data consists of elements that are similar or even identical, exhibiting little variation.

Examples:

Demographics: All of the Grade 11 girls on the netball team at a school. These learners will have the same gender, the same sport, similar ages, weights and heights.
Environmental data: Measurements of the soil pH of one wetland. The soil pH will not vary so much within one wetland.
Medical data: All of the women in the maternity ward of a hospital in a high-income area, between the ages of 20 and 30. These women will be similar to each other in terms of income, how many weeks they are due, and will be identical in gender.
Sales data: The sales records of stationary from the stationary shops in Pretoria. The sales records across all stationary shops will be fairly similar in terms of the products sold (pencils, paper, pens, notebooks, etc.) and the periods during which most sales are made (school supplies at the start of a new term, gifts and wrapping paper during the festive season).

4.3.4 Cluster Sampling

In cluster sampling, the population is divided into groups (clusters), similarly to stratified sampling. In stratified sampling, however, individuals are selected from each group, whereas in cluster sampling, the groups are selected randomly. In the airport security example, Option 7 is an example of a cluster sample. The waiting areas represent the clusters. To perform cluster sampling, Thabang would randomly select one or two of the waiting areas. Then, he could either perform one-stage cluster sampling, in which case he would select all of the individuals in each cluster. Or, he could perform two-stage cluster sampling, whereby he would sample random individuals from each cluster using simple random sampling. In one-stage cluster sampling, it may not be possible to select a precise sample size, since the size of the selected cluster(s) will determine the size of the sample. In two-stage cluster sampling, the sample size can be enforced more easily. For example, if he wanted a sample of size \(n=20\), and selected Areas 1 and 2, he could randomly select \(10\) individuals from Area 1 and \(10\) individuals from Area 2.

Number all \(N\) elements in the population.
Divide the population into mutually exclusive clusters. Each individual should belong to one and only one cluster.
Decide on the number of clusters to sample.
Decide whether to use one-stage or two-stage cluster sampling. If two-stage cluster sampling is selected, decide on a sample size \(n\).
Select a random sample from each stratum using simple random sampling.

Cluster sampling is useful when each cluster is heterogeneous, i.e. elements within clusters are different from each other, but there are no big differences between clusters.

Definition of Heterogeneous Data: Heterogeneous data consists of elements that are substantially different from each other, exhibiting a considerable amount of variation.

Examples:

Demographics: All of the learners in a school, from Grade 1 to Grade 12. These learners will differ substantially from each other in terms of gender, age, height, weight and the sports they prefer.
Environmental data: Soil pH measured across an entire city that has clay-like, sandy and rocky soil. The soil pH will differ substantially based on where in the city each measurement was taken.
Medical data: All of the patients in the west wing of a hospital that includes maternity wards, oncology, and an emergency room. These patients will differ from each other in terms of their health conditions, age and gender.
Sales data: Sales records of grocery shops across countries in the northern and southern hemispheres. These sales records will differ vastly in terms of the kinds of food and supplies sold, as well as when which kind of food will be sold. For example, hearty, rich food will sell better in December in the northern hemisphere, and in July in the southern hemisphere; some countries will not sell pork or alcohol products at all, whereas those same products will be very popular in other countries; some countries will sell specific foods during certain festivals, etc.

4.3.5 Probabilistic Sampling Summary

No probabilistic sampling method is necessarily always better than another. It is important to select the appropriate sampling method based on the problem you are trying to solve, and the nature of the data. The table below summarises the characteristics of each probabilistic sampling method, and lists some of their advantages and disadvantages.

Table 4.2: Probabilistic sampling summary
Sampling Method	Description	Example	Advantages	Disadvantages
Simple Random Sampling (SRS)	Every individual in the population has an equal chance to be selected.	Randomly selecting travellers in the security queue.	Selection bias is minimised; Easy to understand and implement	Difficult for large populations; Risk of underrepresenting some groups
Systematic Sampling	After a random start, every kth individual is selected.	Choosing every 10th traveller in the security queue.	Easier and quicker than SRS; Ensures even coverage of the population	May not be fully random if there is an underlying pattern in the data (e.g., if people are queueing such that every 10th person has a large suitcase, only people with large suitcases will be selected)
Stratified Sampling	The population is divided into strata, and a random sample is taken from each stratum.	Sampling a proportional number of travellers from each waiting area.	Ensures all groups are represented; Can be more reliable than SRS when strata are very different from each other	Needs a more in-depth understanding of the population to define suitable strata
Cluster Sampling	The population is divided into clusters, and entire clusters are randomly selected. In two-stage cluster sampling, samples are taken from the selected clusters.	Choosing waiting areas at random and then selecting all travellers in each selected area.	Practical and cost-effective compared to SRS and Systematic Sampling; Good to use when clusters are naturally occurring groups, e.g. different waiting areas, or schools, or companies	Clusters may not be representative; Naturally occurring clusters will not necessarily be internally heterogeneous but similar to other clusters