Statistics plays an important role in the design and analysis of field studies in vegetation science. At the design stage, you must decide how many samples to use and where to place them. At the analysis stage, you use statistics to extrapolate from your samples to the whole community, to compare different data sets, and to determine your confidence in your conclusions.
Make sure you understand the following important statistical concepts and terms. If you do, decisions you make in your projects (in class and out) will make a lot more sense.
Special note: Go through this section carefully, even if you remember everything in your past statistics courses, because sometimes the core concepts get lost in the onslaught of new statistical tests and number crunching.
Statistical population: The whole group of items or individuals under investigation. This is sometimes called the sampling universe. This course sometimes uses one term, sometimes another.
First decide the research question, then identify the statistical population for which you are asking the research question. You can think of the statistical population as your target, the group of individuals or the area for which you are seeking answers. Identifying the statistical population helps determine the proper sampling scheme and how widely you can apply your conclusions.
|
Think about these concepts with an example from vegetation science. Imagine you are interested in the production of sword fern for sale to the florist industry. Your research question is "what is the annual production per hectare of sword fern?" If you identify your statistical population as tract 3B in Oregon State University's McDonald Forest, you would sample within tract 3B and make inferences only back to tract 3B. If your research questions concern sword fern production in all of McDonald Forest, your field sampling should be from all of McDonald Forest as well. (Photograph courtesy of Gerald and Buff Corsi, California Academy of Sciences) |
![]() |
Sample: A part of a statistical population. The sample is usually made up of a series of observations.
Replication: The repetition of equivalent observations.
Replicate observations are most useful if they are independent of one another and representative of the statistical population (that is, the study area). Good field methods in vegetation science are efficient in locating replicate sample units that are independent and representative. Bad field methods either are slow or don't represent what you are trying to measure or violate the statisical requirement of independence.
Parameter: A quantitative characteristic of the statistical population.
For statistical populations in nature, the parameter of interest is unknown. Often field study is designed to come up with a good estimate what the true value of the parameter is.
Statistic: A quantitative characteristic of the sample.
Estimate: A statistic that is used as a guess of the value of a parameter.
|
Consider an example, first in words and numbers. You
wish to estimate the annual production of sword fern in tract 3B of McDonald
Forest. You measure sword fern production from 13 quadrats properly located
within tract 3B. (Don't worry, you will soon learn how to "properly locate"
quadrats.) These 13 quadrats are your sample; your measurements of production
with these quadrats constitute the13 observations. In this example, the
statistical population consists of all the possible quadrats within
tract 3B. From your 13 values for production, you calculate an average
of 272 g·m-2. This statistic is a good estimate
of the parameter of interest, the true average annual production of sword
fern in tract 3B.
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
(Terminology can be a barrier to understanding, especially if different fields use different terms for the same concept! Click here to see distinctions between terms used by vegetation scientists and statisticians.)
Now consider the same example in mathematical terms. Let
be the parameter of interest, the true average annual production of sword fern.
Your measured values of sword fern production in field plots is xi,
where i is the plot number. A good estimate of
is the average of your plot values:
.
(Statistically good estimates are often shown with a "hat", like
.)
|
A word or two about "units." Ah, English! For a language renowned for it multiplicity of words (an inheritance from Greek, Latin and Anglo-Saxon, and a borrowing from every modern language), some words have too darn many meanings. "Unit" is an example, for it is used in vegetation science in at least two, quite different ways. It is important to understand these meanings so the underlying concepts remain clear. Unit of measurement This meaning of "unit," of course, is not unique to vegetation science. If you measure length in meters, the unit of measurement is meters. If you measure cover in percent, the unit of measurement is percent. If you measure cover in square centimeters, the unit of measurement is square centimeters. Observational unit or sampling unit The observational unit is the entity on which you take (or observe) measurements. It goes like this: "I observe that the {attribute} of this {observational unit} is {measurement result}." For example, "I observe that the height of this student is 1.63 m." The student is the observation unit, the attribute you are measuring is height, and the unit of measurement is meters. The number of students you observe (by measuring their heights) is the level of replication. Consider that last statement a bit more carefully. To determine the number of replicate observations, you must know what the observational unit is. If you measure the heights of 14 students, you have 14 replicate observations. But what if you measure the heights of 14 students in each of 20 randomly selected classrooms? What is the number of replicate observations? It depends on what the observational unit is. In the last example, the observational unit is probably the classroom, so n = 20 replicate observations. But you can't know for sure without understanding the objectives and sampling design of the study. Objectives and sampling design are covered in detail in later chapters. As if that wasn't complicated enough, statistically minded vegetation scientists tend to use the term "sampling unit" as a synonym for "observational unit." This course will use both terms, but they mean essentially the same thing. |
And now...
A cranky aside about "percent cover" Somehow the field of ecology has renamed the attribute "cover" to "percent cover." This annoys me, although it does not seem to bother anyone else. (Hence, this is a cranky note.) Methods state that measurements included "percent cover" and table headings list "percent cover." But if you wanted to weigh yourself, would you announce that you were going to measure "pound weight"? If you were going to measure the height of a plant, would your measurements be of "centimeter height"? "Percent" is a unit, just like pound and centimeter. I contend that the proper style is to state that you will measure just plain "cover." If you indeed measure cover as a percentage (and not as area), then label the column of cover values in a table as "Cover (%)", just as you would list height as "Height (cm)." |
Think again about measuring the heights of 14 students in each of 20 classrooms. You want the replicate observations to be independent of one another. Are the heights of students within a classroom independent of one another? Probably not. Classes with first-year students might have students who are shorter because they are still growing. Classes popular with athletes might tend to have taller students. Classes with heavy enrollment of women might tend to be shorter. Therefore, it is a mistake to think that you have 14 times 20 = 280 replicate observations. If you selected the classrooms at random across campus, then the classrooms will be representative of the campus as a whole and independent of one another. The 14 students within each classroom are subsamples within the replicate observational unit. Your study has n = 20 replicates.
Central tendency is a general term for statistics that describe the location of the center of a distribution. This is a lot simpler than it sounds.
The most common measure of central tendency is the familiar arithmetic mean:
.
In our example of sword fern production, the average of all 13 observations is 272 g·m-2.
Finally consider a graphical display of the data, showing the frequency of different levels of sword-fern production in the sample quadrats.
|
This graph of sword fern production shows some characteristics
common in vegetation science. First, there are several samples with no
production, because sword fern happened not to grow at those sample locations.
Second, the average, although in the middle of the distribution, is not
the most common level of production! Third, the distribution of the data
is not "bell-shaped." Fourth, the data are quite variable, a topic coming
up soon.
|
![]() |
|
Click
here if you want to see the data table again.
|
Accuracy: How close an estimate is to the true value of the parameter being estimated.
When the sample mean (
)
is being used to estimate the population, or true, mean (
),
values of
closer to
are more accurate. What you will learn in this course will increase the
accuracy of measurements in your field studies.
A term related to accuracy is error, the difference between an estimated value and the true value:
![]()
An estimate of high accuracy, then, has low error. One possible measure of accuracy is useful when comparing various estimates of the same true value is
.
Although accuracy is an important statistical concept, in practice with field data you cannot calculate accuracy in a single data set because you do not know the true value of the parameter. (After all, if you already knew the true value, there would be no need to take field measurements!)
Bias: Systematic error.
A method that tends to be either higher or lower than the true value is said to have bias. For example, most applications of the point intercept method of estimating cover (which you'll learn in a later chapter) tend to over-estimate true cover, so this method has a positive bias. The other techniques you'll learn (and even the point-intercept method, if done correctly) are unbiased; that is, on the average estimates will converge on the true value.
You should, of course, use field methods that are unbiased. But you should also be careful in the application of field methods. For example, the biggest danger of bias in vegetation science is the subjective choice of sampling location. In western Oregon, where I work, I have witnessed many examples of field crews somehow — consciously or unconsciously — avoiding poison oak or somehow placing quadrats to include pretty flowers. The more this happens, the more you cannot trust the results.
Likewise, decreasing variability is a good thing ONLY if the method is unbiased. A precise estimate that is always wrong does little good. A large part of good field methods is eliminating the biases.
Note the important difference between the systematic error caused by bias and random, unbiased error. Even unbiased methods can produce individual values that are inaccurate. The differences between individual measurements lead to variability in your observations. Only by collecting many, unbiased observations can start to see the underlying true values.
The variability of observations is another key characteristic of a data set.
Perhaps the simplest measure of variability is the range, which is the difference between the largest value and the smallest. In our sword fern example, the range of observations is 803 g·m-2.
But the most common measure of variability is the so-called standard deviation.
The further that observations are from the mean, the larger the value of the standard deviation (s):

In our sword fern example, the standard deviation is 239 g·m-2.
Precision: The closeness of repeated measurements of the same quantity.
Precision is the opposite of variability. That is, the more variable the data, the farther away the measurements are from one another.
| Precision has a technical meaning that complements but is not the same as accuracy. The classic illustration of the difference is throwing darts at a dart board. If all the darts go in the center (the target), they have both high accuracy and high precision. But what if all the darts go in the same spot that is far from the center? This is a case of high precision but low accuracy. Finally, darts that are thrown to widely different locations that are centered at the target have low precision but high accuracy. | ![]() |
Other uses of the term precision: Sometimes precision is used to describe how carefully something is measured. For example, a measurement of tree diameter to the nearest millimeter is sometimes said to be more precise than a measurement to the nearest centimeter. (Sound familiar?) This is more properly called resolution and differs entirely from the statistical concept of precision just described. Sometimes precision is implied by the number of digits displayed past the decimal point: 3.14159265 vs. 3.14 (or even 3, as one legislator tried to define pi). This is mostly just a question of proper display. A good practice is to carry as many digits as you can while doing calculations but display in the final product only the resolution that is ecological meaningful: 31% vs. 32% cover might be meaningful, but 31.6% vs. 31.7% cover is certainly not.
The variability of observations causes problems in two ways. First, high variability among observations means that your conclusions are not reliable. With all that variability, it seems likely that if you were to do the study again you might just get an entirely different answer. Second, high variability among observations means that the average of that particular set of observations is likely to be far from the true value. This second phenonenon is called sampling error because the variability among sampling units can lead to inaccuracy (error) in your estimation.
Sampling error vs. measurement error: Sometimes accurary is used to describe how carefully something is measured. For example, a measurement of tree diameter to the nearest millimeter is sometimes said to be more accurate than a measurement to the nearest centimeter. This is a valid way to use the term accuracy. The difference between the measured value of a single item and the true value of the item is measurement error. For example, a tree might have a true height of 12.5 m, but you measure the height as 13.1 m. The tendency to mis-measure is also called measurement error. For example, in a sample of 20 trees, you might typically be off by ±0.4 m. A tendency to overestimate or underestimate will lead to a biased estimate. Measurement error and sampling error combine to throw off your results, and it is always a good idea to try to minimize both. You'll learn techniques to accomplish this in later chapters. It is important to realize, however, that in vegetation science, sampling error is almost always larger than measurement error. (You might even keep this is mind while doing some of the class exercises and projects ;-) |
Efficiency is the value of the data collected versus the cost of collecting. Vegetation scientists want field methods that yield data that lead to accurate and precise estimates that represent the statistical population in the study area. When comparing alternative field methods, this concept of value can often be simplified. For example, nearly all the methods you will learn in this course are unbiased, and besides, in field application we cannot know the true accuracy of an individual estimate. As a result, accuracy can be dropped as a standard of comparison. With few exceptions, you should use only sampling designs that fairly represent the statistical population. You will learn more on this later; in the meantime, consider most alternative field methods to produce data that are similar in the tendency to represent the statistical population. That leaves precision as the best measure of value, with better methods producing more precise estimates. When you're comparing studies with different sampling designs, perhaps the best indication of precision is the width of the confidence interval.
On the cost side, the cost of collecting data in vegetation science is largely one of time in the field. So, efficiency is largely precision versus time:
When comparing different field methods, the more efficient method is the one that produces more precise estimates in less time.
You already know that you use a statistic to estimate the value of a parameter,
for example using
to estimate
.
This process of using sample observations to draw conclusions about a statistical
population is called statistical inference. Accuracy and precision
are two hallmarks of good inference. Related to these qualities is representativeness.
That is, the more the sample observations are representative of the statistical
population, the more valid the inference.
Vegetation is often characterized by spatial pattern, with different parts of the plant community containing different plants at different abundances. As a result, observations from a single part of the community will not be representative of the entire community. There are several ways to increase representativeness in sampling when there is spatial pattern. The most straightforward approach is to make sure your sample locations are interspersed through the community. You'll encounter these concepts of representativeness and interspersion throughout the course.
Classical statistical inference (the kind taught in most introductory statistics courses) works because it makes certain assumptions about the statistical population. One of these assumptions is that the distribution of values in the population follows the so-called normal distribution. This can be a good assumption in vegetation science, but should always be checked. (Recognizing when the assumption of normality is invalid, and knowing what to do then, is important for every vegetation scientist. Unfortunately, these topicsdiagnostics, transformations, non-parametric statistics, alternative statistical modelsare beyond the scope of this course.)
The normal distribution has two parameters,
and
.
Although statistical parameters do not have to mean anything, it turns out that
these two parameters are the population mean (
)
and the population variance (
).
What is even more convenient is that the best estimates of these two parameters
are usually the sample mean (
)
and the squared sample standard deviation (s2).
Confidence interval: A range of values intended to include a parameter.
The confidence interval is a statement of inference: You use the sampling data you collect to make inferences about the sampling universe (the whole study area). Statements of confidence intervals usually take the form "I estimate that the average annual production of sword fern in tract 3B is between 127 g·m-2 and 417 g·m-2, with 95% confidence." That is, if you were to take samples from tract 3B over and over again, you expect that 95% of the calculated confidence intervals would include the true value for the average annual production of sword fern. "95%" is the confidence level. Another, compact way to state a confidence interval is "I estimate with 95% confidence that the annual production of sword fern in tract 3B is 272 ± 145 g·m-2." Confidence interval and confidence level are related: The wider the confidence interval, the more confident you are that it contains the true value. But very wide confidence intervals are of little use. For example, estimating sword fern production as 272 ± 250 g·m-2 is much less useful than estimating it as 272 ± 25 g·m-2. The narrower the confidence interval, the better. What you will learn in this course will allow you to design and conduct field studies with estimates that have higher confidence levels and tighter confidence intervals.
Here is the standard formula for confidence intervals:
,
where
is the sample average, s is the sample standard deviation, n is
the size of the sample, and t is the t-statistic. The t-statistic
varies with confidence level and sample size. t gets larger as you set
your confidence level higher. You can see from the formula that setting a higher
level of confidence (as reflected in t) requires a wider confidence interval.
Now look at the formula to see how the size of the confidence interval varies with your data. The more variable your sample data, the larger the value of s, and the wider your confidence interval must be. The larger the sample size (n), the narrower your confidence interval will be. This means that anything that improves the reliability of your data (hence producing a lower s) is good. But even with variable data, you can always make a strong statement if you increase your sampling intensity (that is, increase n). That is, you can always get a good estimate if you work hard enough.
Sometimes the formula for confidence intervals is written differently:
to
,
where
.
is the standard deviation of the mean. That is, the more you expect your estimates
of
to vary, the higher will be the value of
.
is often called standard error.
Think some more about this whole statistical inference thing. As a vegetation
scientist, you want to make statements about your study system. You don't know
the true answer, you must make inferences about your study system from data
you collect in the field. The statistical part of inference involves estimating
the parameters of the population. One of these,
,
is a good estimate of the population's central tendency. The other one, s,
is important in determining the confidence interval. There are three parallel
components here: scientific objectives (what statements do you want to make?),
field sampling (gathering information), and statistical analysis (making inferences
from your observations). One of the goals of this course is to give you the
tools to make these three components work well together.
Another common measure of central tendency is the median. A median is the value for which there are an equal number of observations larger than the value and smaller than the value. In the example of sword fern production, the median is 202 g·m-2. If there are an even number of observations, then the median is the average of the two observations nearest the 50th percentile.
Notice that the value of the median (202 g·m-2) differs from the value of the mean (272 g·m-2). The mean equals the median when the frequency distribution is symmetrical. The frequency distribution in the sword fern example in asymmetrical. (Click here if you want to see the graphed data again.)
The mean is sensitive to extreme values (sometimes called outliers). Because the median is less sensitive to extreme values, it is robust. When extreme values are relatively unimportant (either for ecological reasons or because you don't trust the numbers!), the median is a good choice for describing the central tendency of a set of data.
Consider this example. What if the true cover values (in percentage) were 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 20. The true mean of this set is 2.2%. Your sample of four yields 1, 1, 1, 20, which has a mean of 5.75% and a median of 1%. Which is the more accurate estimate?
Just as the median is a robust alternative to the mean, a robust alternative to the standard deviation is the interquartile range. The interquartile range (IQR) of a set of data is the difference between the 25th percentile (25% of the data are less than this value) and the 75th percentile (75% of the data are less than this value). (For comparison, the median is the 50th percentile.) The more variable the data, the wider the interquartile range.
In the example of sword fern production, the interquartile range is 273 g·m-2.
Now that you've reviewed some basic statistical concepts, it is time to put your knowledge to the test.
Under the Assignments tab of the class site, select the quizzes "Statistical terms I" and "Statistical terms II."
After acing the quizzes, try out the concepts in a graded exercise, Leaf lengths.
Some of the topics in these quizzes and exercises might seem lightweight, but they mirror just the kind of issues faced by vegetation scientists. Clarity in the fundamental concepts in the simple situations here will make the more complicated parts of the course and your future work in vegetation science much, much easier.
© 2007 Mark V. Wilson and Oregon State University