10. More sampling designs: Stratified random sampling and cluster sampling
Simple random sampling is the most important and widely used sampling design in vegetation science. You should be aware, however, of more sophisticated designs, so you can apply them when appropriate. (This chapter is for BOT 540 students.)
Stratified random sampling is useful when you can divide your study area into separate and relatively homogeneous sections. For example, imagine you are interested in estimating the biomass of sword fern in a 20-hectare study area. Within the area, prairie covers 15 hectares and contains just an occasional sword fern, and forest with lots of sword fern covers the remaining 5 hectares. In this example, prairie and forest would be your strata. With stratified random sampling, you split your sampling effort among defined strata and locate your sampling units at random within each stratum. As you'll see shortly, this stratification can increase the precision of your estimate of sword fern biomass.
To reiterate, there are two steps in setting up a stratified random sampling study. First, you divide your study area into relatively homogeneous strata. Second, within each stratum you locate sampling units using simple random sampling. Click here if you need a review of simple random sampling.
First, apply the familiar formulae for mean and standard deviation to each stratum. The stratum mean is
where xhi is the value recorded in the ith observation of stratum h and nh is the number of observations in that stratum. The stratum standard deviation is
The next task is to combine these calculations for each stratum into an overall description of the population. To make the notation simpler, assume that you are sampling per area. (The same principles and calculations hold for sampling per individual, but the notation is a bit different.)
Let A be the area in the entire study area, and Ah the area of stratum h. In the sword fern example, A = 20 ha, the area of the prairie stratum is 15 ha, and the area of the forest stratum 5 ha. The overall mean is then just the average of the stratum means, weighted by the relative size of each stratum:
Likewise, the standard error of the overall mean is just a weighted average of the standard errors for each stratum:
Use the overall mean and its standard error to calculate confidence intervals, as usual. To determine the degrees of freedom (df) of the standard error for stratified random sampling, use this approximation:
These formulae for stratified random sampling ignore the finite population correction, which as you will recall, is safe to do if you are sampling less than 10% of the area of any of the strata.
Let's see how this works. Pretend you collected the data on sword fern biomass from the study area depicted in the illustration. You located 12 quadrats at random within the forested areas and 8 quadrats in the prairie areas.
|Stratum||Sword fern aboveground biomass (g/m2)|
|Forest||271, 105, 369, 454, 58, 251, 157, 329, 570, 401, 97, 382|
|Prairie||0, 0, 45, 0, 12, 28, 0, 5|
You then apply the calculations as described above.
|Stratum area (ha)||5.0||15.0|
|Number of observations||12||8|
|Stratum mean (g/m2)||287.0||11.3|
|Stratum standard deviation (g/m2)||159.1||16.8|
The overall mean (in units of g/m2) is then
The standard error of the overall mean (also in units of g/m2) is
OK, you might be thinking, "What's the big deal?" To estimate the population mean with stratified random sampling, you need not just the values from each quadrat, but also the relative size of each stratum. Stratum size is not a trivial matter to measure, especially when strata are irregular in shape. And the calculations are more complicated than with simple random sampling. Is stratified random sampling worth the hassle?
The advantage of stratified random sampling is that it can significantly increase the precision of your estimates. The greatest increase in precision occurs when the number of sampling units in each stratum is proportional to the contribution of that stratum to the standard error of the overall mean. This is called optimal allocation.
Stop a moment and realize how intuitive this is. Optimal allocation says to put more of your sampling units in strata with higher variability. If a stratum really has no variability, then a single observation would give you all the information you need for that stratum! The extra work of making more than one observation would give you no more information. (Of course, in practice you must make at least a few observations in a stratum in order to estimate its variability.) Likewise, if a stratum has high variability, it would take many observations to get good estimates.
Here is the formula for optimal allocation:
And here is the formula in words: You have n observations to make overall. The number of observations you allocate to each stratum (nh) should be proportional to Ahsh, the contribution of the hth stratum to the overall measure of variability.
Notice that in order to determine the optimal allocation of observations, you need both the area of each stratum and the variability within each stratum. You can get estimates of stratum variability from a pilot project or from previous studies in similar vegetation.
The nice thing about stratified random sampling is that your allocation doesn't have to be exactly optimal in order to get an increase in precision. That means that astute guesses about stratum variability might be good enough. In fact, variability in vegetation is often proportional to abundance. (That is, the more of whatever it is you are measuring, the more variable the observations.) This means that allocating observations in proportion to abundance can give you very good results, even if you have no direct information on within-stratum variability.
Let's see how this works in our sword fern example. It is better to allocate more samples per area within the forest, because the variability is higher within this stratum. In fact, you can calculate the optimal allocation of your 20 quadrats to the forest stratum:
which rounds to 15 quadrats, leaving 5 quadrats to the less variable prairie stratum.
So how much better is stratified random sampling than simple random sampling? When there are big differences among strata the improvement can be sizeable. In the sword fern example, the confidence intervals for the overall mean using stratified random sampling would be about 20% tighter compared to using simple random sampling.
Technical note: These formulae for optimal allocation assume that the cost per observation is the same in each stratum, which is usually the case in vegetation science.
Vegetation scientists, being a generally hardy lot, sometimes conduct field studies under some very trying circumstances. For example, subalpine fell-fields support a fascinating vegetation, but the loose rock and steep slopes make movement treacherous. In this extreme case, the cost of moving from one sample point to another can greatly exceed the cost of actually taking measurements. In fact, in fell-fields the cost of going from one location to another includes both added time and the chance of injury!
If you find yourself in similar circumstances, you might want to use cluster sampling. The form of cluster sampling most useful to vegetation scientists is a two-level sampling design. At the first level, primary sampling areas are located at random within an overall study area. At the second level, secondary sampling units are located at random within each primary sampling area. Each group of secondary sampling units is called a cluster.
Look at the illustration to see an example of how this two-level cluster sampling works. The process is to hike in turn to the location of each of the three primary sampling areas, shown as circles. When you arrive at a primary sampling area, you take the time to make measurements within several secondary sampling units chosen at random within the primary sampling area. In the illustration, five quadrats are selected at random locations within the primary sampling circle. Each of these groups of five quadrats constitutes a cluster.
Let's make the discussion a bit simpler by making a few assumptions:
Designate the number of primary sampling areas as n and the number of secondary sampling units per primary sampling areas as m. xij is the value recorded in the jth secondary sampling unit within the ith primary sampling area. In the type of cluster sampling we are considering, the best estimate of the population mean is simply the overall mean:
That is, calculate the mean for each cluster, then take the average of the cluster means.
The tricky part is calculating s, the standard error of the overall mean. First calculate si, the standard deviations among quadrats within each primary sampling area, i:
Then calculate sp, the standard deviation of the cluster means.
sp measures the variability between primary sampling areas.
Next calculate ss2, the mean of the squared standard deviations within clusters:
All this finally allows you to calculate, s, the standard error of , which is a combination of the variability between primary sampling areas and the average variability within primary sampling areas:
You can use the overall mean and its standard error to calculate confidence intervals, as usual. The degrees of freedom for the standard error from cluster sampling with equal observations per cluster is n(m-1).
Luetkea pectinata (partridge foot) is a low shrub of the rose family found in subalpine fell-fields. Imagine you are interested in the pollination biology of this species and you need to know its overall abundance within your study area, a subalpine fell-field in the Mt. Jefferson Wilderness Area of Oregon. Walking through this fell-field is treacherous, so you have followed a cluster-sampling design, measuring Luetkea cover in five 1-m2 quadrats within each of three primary sampling units.
(Photograph courtesy of Brother Alfred Brousseau, St. Mary's College)
Here are your data:
|Luetkea cover (%)|
|A||0, 12, 27, 8, 14|
|B||22, 0, 0, 13, 0|
|C||17, 31, 28, 0, 11|
You then apply the calculations described above.
|Primary sampling units||
|Number of quadrats||5||5||5|
|Average Luetkea cover (%)||12.2||7.0||17.4|
|Cluster standard deviation||9.9||10.1||12.7|
The overall mean cover (%) is then (12.2 + 7.0 + 17.4) / 3 = 12.2. This is your best estimate of the cover of Luetkea in your study area.
Now calculate s, the standard error of your estimate of the overall mean. The standard deviation of the averages (you're keeping all this straight, I trust) is 5.2. So
How much of your effort should go into traveling from cluster to cluster vs. taking measurements within clusters depends, of course, on how much time it takes to travel between clusters vs. how much time it takes to take measurements. Let cp be the cost in time to travel to each primary sampling area and let cs be the cost in time to take measurements in each secondary sampling unit. Then the optimal number of secondary sampling units in each primary sampling area is
where M is the total number of secondary sampling units available within each primary sampling unit.
Continue to inspect this equation for a few more moments, to see if it works the way it should. Look first at the numerator. The optimal number of secondary sampling units is lower if it is relatively cheap to move to the next primary sampling area (cp) compared to the time it takes to make measurements (cs). Now look at the denominator. The optimal number of secondary sampling units is lower if variability among primary sampling areas (sp) is relatively high compared to the variability within sampling areas (sp). This is just the behavior you want when deciding how many secondary sampling units you should have.
Let's see how this works with numbers, using the data from the Luetkea example. Say it takes 15 minutes to set up a quadrat and measure Luetkea cover, but it takes a full hour to move over the fell-field to the next cluster. That is, cs = 0.25 and cp = 1.0. So,
This calculation shows that the optimal sampling design is to have four or five quadrats per primary sampling area.
To determine how many primary sampling areas you can afford, just determine the cost per primary sampling area, which includes the cost of moving to the area and the cost of sampling from the secondary sampling units. Divide this cost per area into the total time available to you to get the number of primary sampling areas you can afford.
Stratified random sampling and cluster sampling are good sampling designs to have in your ecological tool box. In many cases in vegetation science, when your study area is highly stratified or it takes much effort to move from spot to spot, these designs will give you better resultshigher precision at lower cost. But don't worry too much if you use a simple random sampling design and later decide that another design would have been better. Your results from SRS will still be valid, just not as efficient.
But watch out for a wrong-headed temptation. Let's say you have completed the field work using a simple random sampling design, and you realize that cluster sampling would have been better. You might be tempted to group adjacent quadrats into after-the-fact clusters and apply the calculations from this chapter. This would be not only technically incorrect but harmful, because your calculated confidence intervals would be wider than if you used the formulae for simple random sampling. The moral is to use the equations that exactly match the design you used to place your sampling units.
Apply your knowledge of simple random sampling, stratified random sampling, and cluster sampling by completing the exercise More sampling designs: Calculations and comparisons.
© 2007 Mark V. Wilson and Oregon State University