6. Simple random sampling in the field
The most common sampling design in vegetation science is simple random sampling. Simple random sampling is a type of probability sampling where each sampling location is equally likely to be selected, and the selection of one location does not influence which is selected next. In statistical terms, the sampling locations are independent and identically distributed.
Consider an example of simple random sampling (SRS) of canopy forest trees. You have determined that there are 24 canopy trees in the sampling universe of interest, and you want to take measurements from a subset of this group of 24, using simple random sampling. One way to do this is to number each tree (1-24), put numbers in a hat, and pick one. The tree corresponding to the number is now part of your sampling subset. Each number (that is, each tree) is equally likely to get picked and picking one number doesn't change the probability that another number will get picked next time.
There are two versions of random sampling: sampling with replacement and sampling without replacement. In the example of tree numbers in a hat, if you return the selected number to the hat, the corresponding tree has another chance to get selected. (And if selected, you repeat your measurements on the tree.) That is sampling with replacement. If instead you discard a number once it is selectedsampling without replacementa tree can be selected only once. In vegetation science, SRS without replacement is much more common than SRS with replacement.
Picking numbers out of a hat is perfectly valid, if done correctly, but there are better ways to select random numbers. Even if you are familiar with using random number tables and random number generators in calculators, review the section of the course called How to use random number tables and generators.
The general procedures for any simple random sampling study in vegetation science are about the same. First, as has been emphasized in the course, you must determine your ecological objectives. For example, you might wish to know the stand basal area of a community. Then decide on the sampling scheme, such as sampling by individual or sampling by area, as with quadrats. Then pick individuals or locations at random (this is the simple random sampling part), and take your measurements. Finally, you use the data you collected to make inferences about the whole sampling universe, coming up with statements like "the stand basal area is 49 m2/ha."
In this section of the course you will learn the procedures for locating random samples in the field and the formulas for analyzing data collected from simple random samples.
The first step is to number all the individuals in your sampling universe. In simple random sampling, each of these individuals has an equal chance of being selected. This step is a lot trickier than it might seem. For one thing, you must use an unambiguous definition of what constitutes an individual. Plants that spread vegetatively are notoriously difficult to separate into individuals. You must also enumerate all individuals in your sampling universe; if you don't, you violate the "equal chance of being selected" tenet of simple random sampling. Perhaps the most common use of sampling by individuals is with mature trees, where separate trunks define individuals and it is feasible to number all individuals. Sampling rhizomatous grasses, mosses, and much of the rest of the plant world by individual is usually not feasible.
Once your individuals are numbered, the next step is to select among those numbers at random, using a random number table or random number generator. You will make your measurements on the group of selected individuals.
It can be inefficient to pick a random number, take measurements on that individual, pick another random number, take measurements on that individual, and so forth. Much better is to pick the numbers for all the individuals to be measured ahead of time. Then you can plot a short path that visits each selected individual, and save yourself a lot of time.
|The figure on the right shows how this works. You have selected four trees at random. Don't go to the first tree you selected (marked as 1), make measurements, then traipse to the second tree. Rather, pick an efficient path, as from tree 3 to tree 1 to tree 4 to tree 2.|
As mentioned earlier, the process of selecting random individuals requires an enumeration of all individuals. This enumeration can be an exhausting task. It might be tempting to use other techniques for the random selection of individuals. One of the most tempting shortcuts is to use the coordinate system to find a random location, then select the nearest individual for measurement. Although this sounds good, it is both technically invalid and can produce bad data.
Look at the diagram to see how this approach can go wrong. The illustration uses trees, but the principle holds for most any kind of plant. X marks the spot of a point selected at random; tree A is the nearest tree to this point (see left diagram). The problem with this approach is that plants are seldom uniformly distributed throughout vegetation. In the illustration, the trees are distributed in clumps. Look at the diagram on the right. The irregular polygons show all the points that are closest to the enclosed tree. That is, any random point that lands within the polygon results in that tree being selected. The polygon for tree A is much larger than the polygon for tree B, meaning that tree A is more likely to get selected than tree B. This violates the basic assumption of simple random sampling! Whenever plants are distributed in a non-uniform pattern, isolated individuals are more likely to be selected.
In the illustration, using this flawed technique for selected trees would produce misleading data. Because of crowding, trees within the clumps tend to be stunted and trees on the edge of clumps larger. In the illustration, taking measurements from trees that were selected because they are closest to random points will strongly overestimate tree abundance, because you are more likely to select trees on the edges of clumps.
The coordinate system is easier to explain if we assume that your study area (your sampling universe) is a rectangular tract of vegetation. Later, you'll learn how to relax this requirement. So let's say your study area is 100 m by 60 m, and you want to sample with quadrats selected at random from this area.
|Every point in this 100-m by 60-m rectangle corresponds to a pair of Cartesian coordinates. Call the 100-m side the X axis, and the 60-m side the Y axis. By picking a pair of random numbers, one between 0 and 100 and the other between 0 and 60, you are picking a random location within your study area. The figure shows where your quadrat would be located if you picked as your random pair of numbers X = 60.7 and Y = 36.2|
OK, but finding your quadrat in the field is not as easy as finding it on a diagram. The most efficient process is to create one axis of this coordinate system by placing a meter tape along one side of the study area, with the zero end of the tape at one corner. To locate your plot, go to the point on this axis corresponding to the first number in your random number pair. Then run a second tape out at right angles for a distance corresponding to the second number in your random number pair. To see this process in action, click here. (The coordinates have been rounded in this animation; do not round in the field.)
Repeat this process for each quadrat location. As usual, it is more efficient to select the series of random numbers first, even in the lab well before going to the field. That way you can rearrange the sequence of quadrats into an efficient order.
Once you have your random location for the quadrat, you need a system for actually placing the quadrat on the ground. You want a system that doesn't harm the vegetation and a system that is statistically valid. See the section on 'Hints for dealing with reality' for my advice.
An important note about resolution
The axes in the coordinate system represent continuous numbers from 0 to the end of the axis. When picking random numbers, however, you have to determine how many digits of resolution to use in locating your quadrats. The example used a resolution of whole meters (0 digits). That means that quadrats could not be located at 61.4 m or 60.9 m.
What resolution is acceptable? Use a resolution that is at least as fine as your quadrat size. For example, if you are using a 0.5 m by 0.5 m quadrat, use a resolution of at least 0.5 m. If you use a 1-m resolution, as in the example, then 3/4 of your study area will not be available for sampling! (Do you see why?) Because most quadrats in vegetation science are in the range of 0.2 m to 1.0 m, I recommend using a resolution of 0.01 m or sometimes 0.1 m.
Using a resolution of 0.01 m instead of a resolution of 1 m takes no more work with the random number table, except that you read two additional digits. It also takes no more work in the field. If you are using a standard tape in herbs and low shrubs, measuring to the nearest centimeter is just as fast as measuring to the nearest decimeter (0.1 m) or meter. You still find your position along the tape in the same way. So go with the finest resolution on your measuring device feasible in the vegetation you are studying. That's usually 0.01 m.
|In the grid system, you divide up your study area into non-overlapping quadrat-sized rectangles. See the figure for what this looks like. These rectangles make up a grid for your study area. Do this on paper, not on the ground! Each rectangle segment of the resulting grid is a potential location for a quadrat. Number all the grid rectangles. Pick your quadrat locations by selecting from their numbers at random.|
To actually find these quadrat locations in the field, use the procedure described for the coordinate system.
Now is a good chance to visit How to use random number tables and generators, if you haven't already. This section explains some nuances about using random numbers in the coordinate and grid systems.)
|Many studies in vegetation science do not have the luxury of rectangular study areas. You can still use the coordinate system, but there is some extra work involved. Basically, you pick random coordinates as before but discard any locations that fall outside your sampling universe. This process is a lot easier if you have a map of the area boundary so you can select random locations in the lab.|
The grid system for selecting sample locations does not work well for non-rectangular study areas because the study area usually cannot be broken up into equal-sized rectangles.
The Global Positioning System (GPS), coupled with Geographical Information Systems (GIS), provides an efficient way to locate points in the field. Modern, affordable GPS units can take you to a defined location within 2-5 meters. For locating sites or for locating large sampling plots, GPS can save a lot of effort. For intensive sampling with quadrats less than 200 m2 in area, GPS is usually too coarse and you need to stick with measuring tapes. A good procedure is to use a GPS unit to establish the boundaries of your study area, then use tape and stakes to locate sampling quadrats.
Tips on using GPS in vegetation science:
Sometimes the application of a procedure that sounds straightforward gets tricky in the application. This section presents some hints on dealing with the details of locating your samples.
If your next random sample location is filled with gopher holes or tire ruts, what do you do? If a deer trail runs through it or if the field crew had lunch at that spot, what do you do? There are two important questions involved. First, what is the cause of the damage? Second, what is your sampling universe? If the damage was caused by the process of sampling, skip that location and select another spot (using your procedure for randomization). If the damage was by another agent, like gophers, you then need to decide if locations disturbed by gophers are part of your sampling universe. It is legitimate to exclude damaged locations from your study, but only if you exclude them from your inferences. To see why this is important, think back to our familiar sword-fern example. Originally, the objective was to estimate sword-fern production in the entire tract. If you decide to exclude from your sample locations damaged by skid trails and gravel pits, you must state explicitly that your inferences are only for parts of the forest undamaged by skid trails and gravel pits. After all, if 50% of your forest is damaged by skid trails and gravel pits, extrapolating your measurements from pristine samples to the whole forest would be misleading and wrong.
It is unavoidable. You have to walk through your study area as you establish its boundaries, as you find your sampling locations, and as you shift from side to side as you collect data. If a plot ends up where your boots have ripped up the vegetation, what do you do? (See the previous paragraph.) Best minimize the damage that you and your crew-mates inflict on your study area. Walk on animal trails when you can. Know where your future plots will be, so you can avoid walking through those locations. Eat your lunch outside the study area.
When using the coordinate system, you need to decide if the selected coordinates designate the center of the quadrat or one of its corners. You also need to pick a plot orientation. Just pick a system (like "put the plot center at the selected coordinate and orient the long dimension of the quadrat north to south") and stick with it. The point of the system is to eliminate any subconscious bias in placing the quadrat frame. For example, in my experience, folks tend to move the frame away from poison oak but toward pretty flowers! Having a system protects your data from your subconscious biases.
The coordinate system has problems along the boundaries. Let's say you are using coordinates to locate the center of a 1-m by 1-m quadrat. Then a coordinate of 0.3 m would place part of the quadrat outside the sampling area. In this system, any coordinate value less that half the length of the quadrat will put part of the quadrat outside the study area. (The same for coordinates near the end.) This is no good! You could just skip quadrat locations that extend beyond the edge of the study area. Or you could shrink the size of the quadrat, cutting it off at the boundary of the study area. Or you could fugedaboudit and ignore any quadrat that extends beyond your study area. None of these solutions is completely satisfactory because in different ways they violate the rules of simple random sampling. But the techniques for dealing with this problem correctly are much too difficult to use in the field. What to do?! I recommend that you skip quadrat locations that put the quadrat beyond the edge of the study area. Realize that this isn't quite legitimate, but it is probably the least bad. Besides, in practice, quadrats are much smaller than the study area and these difficulties with boundaries have little impact. But I thought you should know.
Sometimes the selection of random locations leads to quadrats that overlap each other. This is statistically acceptable and goes by the technical name of "sampling with replacement." But overlapping quadrats are hardy ever used in vegetation science. For one thing, the vegetation around the previous quadrat is usually disturbed by the process of sampling. The second, overlapping quadrat would then be damaged and give false data. (See above.) The standard procedure in vegetation science is to drop any random locations that would produce an overlap with a previous quadrat.
An important purpose of these guidelines for locating samples is to take the process out of our subjective hands and into an objective set of procedures. So it is important to follow the objective procedure precisely. But it is also important to recognize which part of your procedures are crucial for maintaining objective, representative, and independent observations -- and which parts are not. Imagine yourself at the end of a hard morning of sampling, when you discover that all your quadrat locations are off by half a meter because the tape establishing one Cartesian axis wasn't pulled quite tight enough. Do you throw away your data from the morning and start over? Not if you're on my crew! As long as the mistake didn't push a location outside your study boundary, everything is OK. The mistake was unintentional, so it couldn't impose a subjective choice on the location of quadrats. The locations are still random and independent of each other. Therefore the data collected from those locations are completely valid. Note the corrected locations, and get ready for the afternoon.
The process of locating lines involves selecting a starting point and a direction. The coordinate system described for locating quadrats also works for locating random starting points. You can then pick a random direction by, for example, picking a random number between 1 and 360 and going in that compass direction. This system has two problems with the boundaries of the study area that are similar to the problems with locating quadrats. The issue is more severe, though, because lines are long and are more likely to extend beyond the edge of the study area.
|If you have a rectangular study area, there is a better way to locate lines. Say you have a 50 m by 100 m study area, and you want to locate 8 lines that are 25 m long. Picking a random starting point along the100-m axis of the study area, and then picking left or right at random (as by flipping a coin), is an efficient way to find line locations. This is valid simple random sampling, because every part of the study area is equally likely to be sampled and the location of one line does not affect the location of any other line.|
With simple random sampling without replacement, the best estimate of the population mean () is usually the sample mean, the mean of your n measurements:
The best estimate of the population variability is usually the standard deviation of your data:
There are separate formulas for and s2 for other sampling designs, like stratified random sampling and cluster sampling. Refer to the course references for details.
Be sure to keep in mind your scientific objective: You want to make statements about the population mean and about your confidence in that mean. That is, you need to know the variability of your estimate of the mean, not the variability of the data. Lucky for us, statistical theory provides a way to convert from describing data to describing the behavior of your estimates of the mean:
where n is the size of your sample, N is the size of the entire population, and
is the amount you expect your estimates of the mean to vary. is often called the standard error.
But what about the factor on the right in the equation? This factor is called the finite population correction, or fpc. It is necessary because statistical distributions describe infinite populations, but sampling is from a carefully delimited (finite) population. (Reminder: The step of defining your study area / statistical population / sampling universe is the step that makes the sampling population finite.) You can see the effect of sample size on fpc at the extremes. When N is very large and n very small, fpc approaches 1 and the formula reduces to that of the familiar standard error. When n = N, fpc = 0, which makes the estimate of variability = 0! But this makes sense because you have measured every member of the population and you now have a census, not a sample. Because you know the whole population, you know the mean exactly and there is no sampling error.
Most studies in vegetation science ignore the finite population correction. Although technically incorrect, in practice it has little effect because sampling intensity in vegetation science is typically very low. For example, the sampling intensity of a study using 20 1-m2 quadrats per hectare is only 20/10000, so the fpc is
which is very close to 1.0. For the rest of the course, we will follow this grand tradition and usually not bother with the finite population correction factor unless sampling intensity goes above 10%.
The next step is to convert your estimates of the population mean and its variability into confidence intervals.
The statistical formula for the confidence interval with simple random sampling is the same as the standard formula (see the Statistical Background chapter and the Confidence Interval primer):
As before, is usually the best estimate of the population mean, t, the t-statistic, reflects both the number of samples and the level of confidence you have set (like 90%), and , the standard error, reflects the variability in the data.
|Before you use your carefully calculated values of central tendency and variability, pause a while to reflect on what contributes to the variability you measure. If your technique of vegetation measurement varied from one time to another (and you know it did) then this measurement variability contributes to overall variability.||
If the vegetation itself varied from one sample location to another (and it always does), then this spatial heterogeneity or sampling variability contributes to overall variability.
|Here's the important part. You can reduce the effect of sampling variability just by collecting more statistically valid samples. But the only way to reduce measurement variability is to get better at conducting the measurements themselves. That is what a lot of Chapter 3 was about, and what Chapter 9 will state again.|
At this point, go to Assignments in Blackboard and select the quiz called "Using random numbers," if you haven't already. Then test your understanding of locating simple random samples with the exercise Locating quadrats.
© 2005 Mark V. Wilson and Oregon State University