Additional Data Problems for the Statistical Sleuth

(Data stories and Excel data files)

Chapter 2. Two-Sample Problems

Speed Limits and Traffic Fatalities.

Chapter 3. A Closer Look at Assumptions

Chapter 4. Alternatives to the t-tools
Therapeutic marijuana

Chapter 7.  Simple Linear Regression

Chapter 8. A Closer Look at Assumptions for Simple Linear Regression

Chapter 9. Multiple Regression

Winning speeds at the Kentucky Derby
Chapter 10. Inferential Tools for Multiple Regression

Chapter 11. Model Checking and Refinement

Chapter 12. Strategies for Variable Selection

Chapter 13. The Analysis of Variance for Two-Way Classifications

Chapter 14. Multifactor Studies Without Replication
Tennessee corn yield trials

Chapter 15. Adjustment for Serial Correlation

Chapter 16. Repeated Measures

Chapter 18. Comparisons of Proportions or Odds

Chapter 19. More Tools for Tables of Counts

Chapter 20. Logistic Regression for Binary Response Variables

Fatal car accidents involving tire failure on Ford Explorers

Chapter 21. Logistic Regression for Binomial Counts

Chapter 22. Log-Linear Regression for Poisson Counts

2.23. Speed Limits and Traffic Fatalities. The National Highway System Designation Act was signed into law in the United States on November 28, 1995. Among other things, the act abolished the federal mandate of 55 mile per hour maximum speed limits on roads in the U.S. and permitted states to establish their own limits. Of the 50 states (plus the District of Columbia), 32 increased their speed limits either at the beginning of 1996 or sometime during 1996. Shown below are the percentage changes in interstate highway traffic fatalities from 1995 to 1996. What evidence is there that the percentage change was greater in states that increased their speed limits? How much of a difference is there? Write a brief statistical report detailing the answers to these questions. (Data from “Report to Congress: The Effect of Increased Speed Limits in the Post-NMSL Era,” National Highway Traffic Safety Administration, February, 1998; available in the reports library at http://www-fars.nhtsa.dot.gov/.)

Data: ex0223.xls

3.22. Umpire life lengths. When an umpire collapsed and died soon after the beginning of the 1996 U.S. major league baseball season, there was speculation that the stress associated with that job poses a health risk. Researchers subsequently collected historical and current data on umpires to investigate their life expectancies (Cohen, et al., 2000, “Life expectancy of major league baseball umpires,” The Physician and Sportsmedicine, 28, 5, 83-89). From an original list of 441 umpires, data were found for 227 who had died or had retired and were still living. Of these, dates of birth and death were available for 195. Shown below are several rows of a generated data set based on the study.

a)   Use a t-test and confidence interval (possibly after transformation) to investigate whether umpires had smaller observed life lengths than expected, using only those with known life lengths (i.e. for whom Censored = 0)

b)      What are the potential consequences of ignoring those 214 of the 441 umpires on the original list for whom data was unavailable?

c)      What are the potential consequences of ignoring those 32 umpires in the data set who had not yet died at the time of the study? (Note: appropriate procedures are available—and are appropriate—for answering the question of interest using the censored and uncensored life times. See, for example, the survival analysis techniques in Anderson, S. et al. 1980, Statistical Methods for Comparative Studies, Wiley.)

Data: ex0322.xls

4.32. Therapeutic marijuana. Nausea and vomiting are frequent side effects of cancer chemotherapy, which can contribute to the decreased ability of patients to undergo long-term chemotherapy schedules. To investigate the capacity of marijuana to reduce these side effects, researchers performed a double-blind, randomized, cross-over trial. Fifteen cancer patients on chemotherapy schedules were randomly assigned to receive either a marijuana treatment or a placebo treatment after their first three chemotherapy sessions, and then “crossed over” to the opposite treatment after their next three sessions. The treatments, which involved both cigarettes and pills, were made to appear the same whether in active or placebo form. Shown below is the number of vomiting and retching episodes for the 15 subjects. Does marijuana treatment reduce the frequency of episodes? By how much. Analyze the data and write a statistical summary of conclusions. (Data from Chang, A. E., et al., “Delta-9-Tetrahydrocannibinol as an Antiemetic in Cancer Patients Receiving High-Dose Methotrexate,” The Science of Medical Marijuana, Dec. 1979. The order of the treatments is unavailable.)

Data: ex0432.xls

7.26.    Decline in Male Births. Display 7.16 shows the proportion of male births in Denmark, The Netherlands, Canada, and the United States for a number of years. (Data read from graphs in Davis, et al., 1998, “Reduced ratio of male to female births in several industrial countries,” Journal of the American Medical Association, 279, 1018-1023.) Notice that the proportions for Canada and the United States are only provided for the years 1970 to 1990, while Denmark and The Netherlands have data listed for 1950 to 1994. Display 7.17 shows the results of least squares fitting to the simple linear regression of proportion of males on year, separately for each country, with standard errors of estimated coefficients in parentheses.

a) With a statistical computer package obtain the least squares fits to the four simple regressions, individually, to confirm the estimates and standard errors presented in Display 7.17.

b) Obtain the t-statistic for the test that the slopes of the regressions are zero, for each of the four countries. Is there evidence that the proportion of male births is truly declining?

c) Explain why the United States can have the largest of the four t-statistics (in absolute value) even though its slope is only the third largest (in absolute value).

d) Explain why the standard error of the estimated slope is smaller for the United States than for Canada, even though the sample size is the same.

c)      Can you think of any reason why the standard deviations about the regression line might be different for the four countries? (Hint: the proportion of males is a kind of average, i.e. the average number of births that are male.)

Data: ex0726.xls

7.29          Male Displays. Black wheatears, Oenanthe leucura, are small birds of Spain and Morocco. Males of the species demonstrate an exaggerated sexual display by carrying many heavy stones to nesting cavities. This 35-gram bird transports, on average, 3.1 kg. of stones per nesting season! Different males carry somewhat different sized stones, prompting a study of whether larger stones may be a signal of higher health status. M. Soler, et al. [“Weight lifting and health status in the black wheatear,” 1999, Behavioral Ecology 10(3):281-6] calculated the average stone mass (g) carried by each of 21 male black wheatears, along with T-cell response measurements reflecting their immune systems’ strengths. The data in Display 7.16 were taken from their Figure 1. Analyze the data and write a statistical report summarizing the evidence supporting whether health, as measured by T-cell response, is associated with stone mass; and quantifying the association.

Data: ex0729.xls

7.30. Brain activity in violin and string players. Studies over the past two decades have shown that activity can effect the reorganization of the human central nervous system. For example, it is known that the part of the brain associated with activity of a finger or limb is taken over for other purposes in individuals whose limb or finger has been lost. In one study, psychologists used magnetic source imaging (MSI) to measure neuronal activity in the brains of 9 string players (6 violinists, 2 cellists, and 1 guitarist) and 6 controls who had never played a musical instrument, when the thumb and fifth finger of the left hand were exposed to mild stimulation. The researchers felt that stringed instrument players, who use the fingers of their left hand extensively, might show different behavior in the brain—as a result of this extensive physical activity—than individuals who did not play stringed instruments. Shown below is a neuron activity index from the MSI and the years that the individual had been playing a stringed instrument (zero for the controls). (Data based on a graph in Elbert, T., et al., 1995, “Increased cortical representation of the fingers of the left hand in string players,” Science, 270, 13 October, 305-307.) Is the neuron activity different in the stringed musicians and the controls? Is the amount of activity associated with the number of years the individual has been playing the instrument?

Data:  ex0730.xls

8.23.    Respiratory Rates for Children. A high respiratory rate is a potential diagnostic indicator of respiratory infection in children. To judge whether a respiratory rate is truly “high,” however, a physician must have a clear picture of the distribution of normal respiratory rates. To this end, Italian researchers measured the respiratory rates of 618 children between the ages of 15 days and 3 years. The display below shows a few rows of the data set. Analyze the data and provide a statistical summary. Include a useful plot or chart that a physician could use to assess a normal range of respiratory rate for children of any age between 0 and 3. (Data read from a graph in Rusconi, et al., 1994, “Reference values for respiratory rate in the first 3 years of life,” Pediatrics, 94, 350-355.).

Data: ex0823.xls

8.24.    Butterfly ballots in Palm Beach County, Florida. The U.S. presidential election of November 7, 2000 was one of the closest in history. As returns were counted on election night it became clear that the outcome in the state of Florida would determine the next president. At one point in the evening, television networks projected that the state was carried by the Democratic nominee, Al Gore, but a retraction of the projection followed a few hours later. Then, early in the morning of November 8, the networks projected that the Republican nominee, George W. Bush, had carried Florida and won the presidency. Gore called Bush to concede. While on route to his concession speech, though, the Florida count changed rapidly in his favor. The networks once again reversed their projection, and Gore called Bush to retract his concession. When the roughly six million Florida votes had been counted, Bush was shown to be leading by only 1,738, and the narrow margin triggered an automatic recount. The recount, completed in the evening of November 9, showed Bush’s lead to be less than 500.

Meanwhile, angry Democratic voters in Palm Beach County complained that a confusing “butterfly” lay-out ballot caused them to accidentally vote for the Reform Party candidate Pat Buchanan instead of Gore. The ballot, as illustrated in Display 8.22, listed presidential candidates on both a left-hand and a right-hand page. Voters were to register their vote by punching the circle corresponding to their choice, from the column of circles between the pages. It was suspected that since Bush’s name was listed first on the left-hand page, Bush voters likely selected the first circle. Since Gore’s name was listed second on the left-hand side, many voters—who already knew who they wished to vote for—did not bother examining the right-hand side and consequently selected the second circle in the column; the one actually corresponding to Buchanan. Two pieces of evidence supported this claim: Buchanan had an unusually high percentage of the vote in that county, and an unusually large number of ballots (19,000) were discarded because voters had marked two circles (possibly by inadvertently voting for Buchanan and then trying to correct the mistake by then voting for Gore).

Display 8.23 shows the first few rows of a data set containing the numbers of votes for Buchanan and Bush in all 68 counties in Florida. What evidence is there in the scatterplot of Display 8.24 that Buchanan received more votes than expected in Palm Beach County? Analyze the data without Palm Beach County results to obtain an equation for predicting Buchanan votes from Bush votes. Obtain a 95% prediction interval for the number of Buchanan votes in Palm Beach from this result—assuming the relationship is the same in this county as in the others. If it is assumed that Buchanan’s actual count contains a number of votes intended for Gore, what can be said about the likely size of this number from the prediction interval. (Consider transformation.)

Data: ex0824.xls

9.18.        Speed of Evolution. How fast can evolution occur in nature? Are evolutionary trajectories predictable or idiosyncratic? To answer these questions, R.B. Huey et al. (“Rapid evolution of a geographic cline in size in an introduced fly”, Science 287:308-9, 1990) studied the development of a fly — Drosophila subobscura — that had accidentally been introduced from the Old World into North America (NA) around 1980. In Europe (EU), characteristics of the flies’ wings follow a “cline” — a steady change with latitude. One decade after introduction, the NA population had spread throughout the continent, but no such cline could be found. After two decades, Huey and his team collected flies from 11 locations in western NA and native flies from 10 locations in EU at latitudes ranging from 35-55 degrees N. They maintained all samples in uniform conditions through several generations to isolate genetic differences from environmental differences. Then they measured about 20 adults from each group. Display 9.19 shows average wing size in millimeters, on a logarithmic scale, and average ratios of basal lengths to wing size.

a)   Construct a scatter plot of average wing size against latitude, in which the four groups defined by continent and sex are coded differently. Do these suggest that the wing sizes of the NA flies have evolved toward the same cline as in EU?

b) Construct a multiple linear regression model with wing size as the response, with latitude as a linear explanatory variable, and with indicator variables to distinguish the sexes and continents. Construct the model in such a way that one parameter measures the difference between the slopes of the wing size v. latitude regressions of NA and EU for females, one measures the same difference for males, one measures the difference between the intercepts of the regressions of NA and UE for females, and one measures the same difference for males.

Data: ex0918.xls

9.20. Winning speeds at the Kentucky Derby. The Kentucky Derby is a 1.25-mile horse race held annually at the Churchill Downs racetrack in Louisville, Kentucky. Shown below are some sample rows of a data set containing the year of the race, the winning horse, the condition of the track, and the average speed (in feet per second) of the winner, for years 1896-2000. The track conditions have been grouped into three categories: fast, good (which includes the official designations “good” and “dusty”), and slow (which includes the designations “slow’, “heavy”, “muddy”, and “sloppy”). Use a statistical computer program to fit a model for the mean winning speed as a function of year and the track condition factor. The data are from www.kentuckyderby.com.

Data: ex0920.xls

10.23. Speed of Evolution. Refer back to Exercise 9.18. The authors of that study concluded that although the wing size of North American flies was converging rapidly to the same cline as exhibited by the European flies, the means by which the cline is achieved is different in the North American population.

a)      As evidence that the means of convergence is different, they concluded that there was a marked difference between the NA and the EU patterns of the basal length-to-wing size ratios versus latitude (in females). Fit a multiple linear regression, which allows for different slopes and different intercepts. In a single F-test, evaluate the evidence against there being a single straight line that describes the cline on both continents. If you conclude there is a difference, is the difference one of slope alone? of intercept alone? or of both?

b)      Return to the basic question of whether the wing sizes in NA flies have established a cline similar to their EU ancestors. Using the model developed in Exercise 9.18, answer these questions: (i) Is there a non-zero slope to the cline of NA females? (ii) Is there a non-zero slope to the cline of NA males? (iii) Is there a difference between the clines of NA and EU females, and if so, what is its nature? and (iv) repeat (iii) for males?

10.24. Speed of Evolution. (Refer again to Exercise 9.18 and also to Exercise 10.23.) Many software systems allow the user to perform weighted regression, in which different squared residuals from regression receive different weights in deciding which set of parameter estimates provide the smallest sum of squared residuals. If each individual response has an independent estimate of its likely error, the weight given to each residual is usually taken to be the reciprocal of the square of that likely error. The st.err. of wing sizes are standard errors of the averages of around 2 individual (log) wing sizes. If your software allows for weights, construct a weight variable as the inverse square of the standard errors. Then repeat both parts of Exercise 10.23 using weighted regression. Do the results differ? Why is this preferable to using each fly as a separate case?

10.25. Potato Yields. Nitrogen and water are important factors influencing potato production. One study of their roles was conducted at sites in the St. John River Valley of New Brunswick. (Belanger, G., et al. 2000. “Yield response of two potato cultivars to supplemental irrigation and N fertilization in New Brunswick. Amer. J. of Potato Res. 77:11-21.) Nitrogen fertilizer was applied at six different levels in combination with two water conditions: irrigated or non-irrigated. This design was repeated at four different sites in 1996, with the resulting yields depicted in Display 10.21. Notice that the patterns of responses against nitrogen level are fit reasonably well by quadratic curves.

Each quadratic requires 3 parameters, so a model that would allow for separate quadratic curves for each site-by-irrigation combination would have 24 parameters. (a) Using indicator functions for sites and for irrigation, construct a multiple linear regression model with 23 variables that will allow for completely different quadratic curves. Interpret the parameters in this model, if possible. (b) Describe how you would answer the following questions: (i) Is there evidence that the manner in which the quadratic terms differ by water condition changes from site to site (or is the difference the same at all four sites)? (ii) If the quadratic term differences are the same at all sites, is there strong evidence of a difference by water condition? (iii) If there is no difference between quadratic terms by water or by site, is there evidence of any quadratic term at all? (iv), (v), and (vi) repeat (i), (ii), and (iii) for the linear terms, if there is no evidence of any quadratic terms. (c) Why are the questions in (b) ordered as they are?

10.28.  El Nino and Hurricanes. Shown below are the first few rows of a data set with the numbers of Atlantic Basin tropical storms and hurricanes for each year from 1950 to 1997. The variable storm index, is an index of overall intensity of the hurricane season. (It is the average of number of tropical storms, number of hurricanes, the number of days of tropical storms, the number of days of hurricanes, the total number of intense hurricanes, and the number of days they last—when each of these is expressed as a percentage of the average value for that variable. A storm index score of 100, therefore, represents, essentially, an average hurricane year.) Also listed are whether the year was a cold, warm, or neutral El Nino year; a constructed numerical variable temperature that takes on the values -1, 0, and 1 according to whether the El Nino temperature is cold, neutral, or warm; and a variable indicating whether West Africa was wet or dry that year. It is thought that the warm phase of El Nino suppresses hurricanes while a cold phase encourages them. It is also thought that wet years in West Africa often bring more hurricanes. Analyze the data to describe the effect of El Nino on (a) the number of tropical storms, (b) the number of hurricanes and (c) the NTC, after accounting for the effects of West African wetness and for any time trends, if appropriate. (These data were gathered by William Gray of Colorado State University, and reported on the USA Today weather page: www.usatoday.com/weather/whurnum.htm)

Data: ex1028.xls

10.29.  Wage and race. Shown below are the first few rows of a data set from the 1988 March U.S. Current Population Survey. The set contains weekly wages in 1987 (in 1992 dollars) for a sample of 25,632 males between the age of 18 and 70 who worked full-time, along with their years of education, years of experience, an indicator variable for whether they were black, an indicator variable for whether they worked in a standard metropolitan statistical area (i.e. in or near a city), and a code for the region in the US where they worked (northeast, midwest, south, and west). Analyze the data and write a brief statistical report to see whether and to what extent black males were paid less than non-black males in the same region and with the same levels of education and experience. Realize that the extent to which blacks were paid differently than non-blacks may depend on region. (Suggestion: refrain from looking at interactive effects, except for the one implied by the previous sentence.) (These data were discussed in the paper, Bierens, H. J. and D. K. Ginther (2000) “Integrated Conditional Moment Testing of Quantile Regression Models,” to appear in a special issue of Empirical Economics on Economic Applications of Quantile Regression; and made available at the web site http://econ.la.psu.edu/~hbierens/MEDIAN.HTM associated with the software EasyReg.)

Data: ex1029.xls. WARNING: large data set.

11.24. Natal dispersal distances of mammals. Natal dispersal distances are the distances that juvenile animals travel from their birthplace to their adult home. An assessment of the factors affecting dispersal distances is important for understanding population spread, recolonization, and gene flow—which are central issues for conservation of many vertebrate species. For example, an understanding of dispersal distances will help to identify which species in a community are vulnerable to the loss of connectedness of habitat. To further the understanding of determinants of natal dispersal distances, researchers gathered data on body weight, diet type, and maximum natal dispersal distance for various animals. Shown below are the first 6 of 64 rows of data on mammals. (Data from Sutherland, G.D., et al., 2000, “Scaling of natal dispersal distances in terrestrial birds and mammals,” Conservation Ecology 4(1): 16.) Analyze the data to describe the distribution of maximum dispersal distance as a function of body mass and diet type. Write a summary of statistical findings.

Data: ex1124.xls

11.xx.  Acorn. The acorn data set in DASL is very nice as a problem with issues of influential observations.  Students may need some guidance though.

12.22. Bush-Gore ballot controversy. Review the Palm Beach Country ballot controversy description in Exercise 8.24. To estimate how much of Pat Buchanan’s vote count might have been intended for Al Gore in Palm Beach County, Florida, that exercise required the fitting of a model for predicting Buchanan’s count from Bush’s count from all other counties in Florida (excluding Palm Beach), followed by the comparison of Buchanan’s actual count in Palm Beach to a prediction interval. One might suspect that the prediction interval can be narrowed and the validity of the procedure strengthened by incorporating other relevant predictor variables. Display 12.19 shows the first few rows of a data set containing the vote counts by county in Florida for Buchanan and for four other presidential candidates in 2000, along with the total vote counts in 2000, the presidential vote counts for three presidential candidates in 1996, the vote count for Buchanan in his only other campaign in Florida—the 1996 Republican primary, the registration in Buchanan’s Reform Party, and the total registration in the county. Analyze the data and write a statistical summary predicting the number of Buchanan votes that were not intended for him. It would be appropriate to describe any unverifiable assumptions used in applying the prediction equation for this purpose. (Suggestion: find a model for predicting Buchanan’s 2000 vote from other variables, excluding Palm Beach County, which is listed last in the data set. Consider a transformation of all counts.)

Data: ex1222.xls

13.18. El Nino and Hurricanes. Reconsider the El Nino and Hurricane data set from exercise 10.28 above. (a) Regress the log of the storm index on West African wetness (treated as a categorical factor with 2 levels) and El Nino temperature (treated as a categorical factor with 3 levels); retain the sum of squared residuals and the residual degrees of freedom. (b) Regress the log of the storm index on West African wetness (treated as categorical with 2 levels), El Nino temperature (treated as numerical), and the square of El Nino temperature. Retain the sum of squared residuals and the residual degrees of freedom. (c) Explain why the answers to a and b are the same. (d) Explain why a test that the coefficient of the temperature-squared term is zero can be used to help decide whether to treat temperature as numerical or categorical.

13.22. Gender differences in performance on mathematics achievement test scores. Display 13.25 shows the first few rows of a data set on 861 ACT Assessment Mathematics Usage Test scores from 1987. The test was given to a sample of high school seniors who met one of three profiles of high school mathematics course work: a: Algebra I only, b: two Algebra courses and Geometry, and c: two Algebra courses, Geometry, Trigonometry, Advanced Mathematics, and Beginning Calculus. Analyze the data and write a brief statistical report to determine whether male scores are distributed differently than female scores, after accounting for coursework profile, and whether the difference is the same for all profiles. (These data were generated from summary statistics for one particular form of the test, as reported in Doolittle, A. E., 1987, Gender differences in performance on mathematics achievement items, ACT Research Report Series, 87-16.)

Data: ex1322.xls

14.17. Tennessee corn yield trials.   Corn yield trials were performed at four locations in Tennessee in 1999.  Shown in Display 14.22 are the average yields, in bushels per acre, for 6 hybrids at each of the four locations.  Notice that at the Ames Plantation there were two trials, one unirrigated and one irrigated.  Do any of the hybrids’s have mean yields that are higher than the others?  Do yellow corn hybrids have means that differ from the white corn hybrids? (Data from the University of Tennessee Agricultural Experiment Station web site, http://web.utk.edu/~taescomm/research/corn1999.html).

Data: ex1417.xls

15.11. El Nino and Hurricanes. Reconsider the El Nino and Hurricane data set from exercise 10.28. Regress the log of the storm index on temperature and the indicator variable for West African wetness and retain the residuals. (a) Construct a lag plot of the residuals, as in Display 15.5. (b) Construct a partial autocorrelation function plot of the residuals. (c) Is there any evidence of autocorrelation? How many lags?

15.14.    Trends in firearm and motor vehicle deaths in the U.S. Display 15.16 shows the number of deaths due to firearms and the number due to motor vehicle accidents in the United States between 1968 and 1993. Is there evidence of an increasing or decreasing trend in firearm deaths over this period? What is the rate of increase or decrease? Is there evidence of an increasing or decreasing trend in motor vehicle deaths over this period? What is the rate of increase or decrease? (The data were read from a Centers for Disease Control and Prevention graph reported in The Oregonian, June 17, 1997.)

Data: ex1514.xls

15.18. S & P 500. The Standard and Poors 500 stock index (S&P 500) is a benchmark of stock market performance, based on the values of 400 industrial firms, 40 financial stocks, 40 utilities and 20 transportation stocks. Display 15.15 shows the value of a \$1 investment in 1871 at the end of each year from 1870 to 1999, according to the S&P 500, assuming all dividends are reinvested. Describe the distribution of the S&P value as a function of year.

Data: ex1518.xls

16.15.  Trends in SAT scores. Display 16.17 shows a partial listing of a data set with ratios of average Math to Verbal SAT scores in the 50 U.S. states plus the District of Columbia for 1989 and 1996-1999. Is the mean of the ratios different in 1999 than in 1989? Is there an increasing trend in the ratios over the period from 1996 to 1999? Analyze the data and write a brief statistical report of the findings.

Data: ex1617.xls

18.18.    Hale-Bopp and handedness. It is known that left-handed people tend to recall orientations of human heads or figures differently than right-handed people. To investigate whether there is a similar systematic difference in recollection of inanimate object orientation, researchers quizzed University of Oxford undergraduates on the orientation of the tail of the Hale-Bopp comet. The students were shown eight photographic pictures with the comet in different orientations (head of the comet facing left down, left level, left up, center up, right up, right level, right down, or center down) six months after the comet was visible in 1997. The students were asked to select the correct orientation. (The comet faced to the left and downward.) Shown below are the responses categorized as correct or not, shown separately for left- and right-handed students. Is there evidence that left- or right-handedness is associated with correct recollection of the orientation. If so, quantify the association. Write a brief statistical report of the findings. (Data from Martin and Jones, 1999, Hale-Bopp and handedness: individual difference in memory for orientation, American Psychological Society.)

19.19.    Tire-related fatal accidents and ford sports utility vehicles. The table in Display 19.13 shows the numbers of compact sports utility vehicles involved in fatal accidents in the U.S. between 1995 and 1999, categorized according to travel speed, make of car (Ford or other), and cause of accident (tire-related or other). From this table, test whether the odds of a tire-related fatal accident depend on whether the sports utility vehicle is a Ford, after accounting for travel speed. For this subset of fatal accidents, estimate the excess number of Ford tire-related accidents. (This is a subset of data described more fully in exercise 20.18.).

Data: ex1919.xls

20.18.    Fatal car accidents involving tire failure on Ford Explorers.  The Ford Explorer is a popular sports utility vehicle made in the U.S. and sold throughout the world. Early in its production concern arose over a potential accident risk associated with tires of  the prescribed size when the vehicle was carrying heavy loads, but the risk was thought to be acceptable if a low tire pressure was recommended.  The problem was apparently exacerbated by a particular type of Firestone tire that was overly prone to tread separation, especially in warm temperatures.  This type of tire was a common one used on Explorers in model years 1995 and later. By the end of 1999 more than 30 lawsuits had been filed over accidents that were thought to be associated with this problem. U.S. federal data on fatal car accidents were analyzed at that time, showing that the odds of a fatal accident being associated with tire failure were three times as great for Explorers as for other sports utility vehicles.  Additional data from 1999 and additional variables may be used to further explore the odds ratio.  Display 20.19 lists data on 1995 and later model compact sports utility vehicles involved in fatal accidents in the U.S. between 1995 and 1999, excluding those that were struck by another car and excluding accidents that, according to police reports, involved alcohol. It is of interest to see  whether the odds that a fatal accident is tire-related depend on whether the vehicle is a Ford, after accounting for age of the car and number of passengers.  Since the Ford tire problem may be due to the load carried, there is some interest in seeing whether the odds associated with a Ford depend on the number of passengers.  (Suggestions: (i) Presumably, older tires are more likely to fail than newer ones.  Although tire age is not available, vehicle age is an approximate substitute for it.  Since many car owners replace their tires after the car is 3 to 5 years old, however, we may expect the odds of tire failure to increase with age up to some number of years, and then to perhaps decrease after that.  (ii) If there is an interactive effect of Ford and the number of passengers, it may be worthwhile to present an odds ratio separately for 0, 1, 2, 3, and 4 passengers.) The data are from the National Highway Traffic Safety Administration, Fatality Analysis Reporting System (http://www-fars.nhtsa.dot.gov/).

Data: ex2018.xls

21.14. Spock conspiracy trial. Reconsider the proportion of women on venires in the Boston U.S. district courts (case study 5.2 in book). Analyze the data by treating the number of women out of 30 people on a venire as a binomial response. (a) Do the odds of a female on a venire differ for the different judges? Answer this with a drop-in-deviance chi-square test, comparing the full model with judge as a factor to the reduced model with only an intercept. (b) Do judges A-F differ in their probabilities of selecting females on the venire? Answer this with a drop-in-deviance chi-square test by comparing the full model with judge as a factor to the reduced model which has an intercept and an indicator variable for Spock’s judge. (c) How different is the odds of a woman on Spock’s judge’s venires from the odds on the other judges. Answer this by interpreting the coefficients in the binomial logistic regression model with an intercept and an indicator variable for Spock’s judge.

21.17.    Effect of Stress During Conception on Odds of a Male Birth. The probability of a male birth in humans is about .51. It has previously been noticed that lower proportions of male births are observed when offspring are conceived at times of exposure to smog, floods, or earthquakes. Danish researchers hypothesized that sources of stress associated with severe life events may also have some bearing on the sex ratio. To investigate this theory they obtained the sexes of all 3,072 children who were born in Denmark between 1 January 1980 and 31 December 1992 to women who experienced the following kinds of severe life events in the year of the birth or the year prior to the birth: death or admission to hospital for cancer or heart attack of their partner or of their other children. They also obtained sexes on a sample of 20,337 births for mothers who did not experience these life stress episodes. Shown in the table below are the percentages of boys among the births, grouped according to when the severe life event took place. Notice that for one group the exposure is listed as taking place during the first trimester of pregnancy. The rationale for this is that the stress associated with the cancer or heart attack of a family member may well have started before the recorded time of death or hospital admission. Analyze the data to investigate the researchers’ hypothesis. Write a summary of statistical findings. (Source: Hansen, et al., 1999, “Severe periconceptional life events and the sex ratio in offspring: follow up study based on five national registers,” British Medical Journal, 319: 548-549.

Data: ex2117.xls

21.18.    HIV and circumcision. Researchers in Kenya identified a cohort of over 1000 prostitutes who were known to be a major reservoir of sexually transmitted diseases in 1985. It was determined that over 85% of them were infected with human immunodeficiency virus (HIV) in February, 1986. The researchers then identified men who acquired a sexually transmitted disease from this group of women after the men sought treatment at a free clinic. The table below shows the subset of those men who did not test positive for the HIV virus on their first visit and who agreed to participate in the study. The men are categorized according to whether they later tested positive for HIV during the study period, whether they had one or multiple sexual contacts with the prostitutes, and whether they were circumcised. Describe how the odds of testing positive are associated with number of contacts and with whether the male was circumcised. (Data from Cameron, et al., 1989, Female to male transmission of human immunodeficiency virus type 1: risk factors for seroconversion in men, The Lancet.)

Data: ex2118.xls

21.19.    Meta-analysis of breast cancer and lactation studies. Meta-analysis refers to the analysis of analyses. When the main results of studies can be cast into a two-by-two table of counts, it is natural to combine individual odds ratios with a logistic regression model that includes a factor to account for different odds from the different studies. In addition, the odds ratio itself might differ slightly among studies because of different effects on different populations or different research techniques. One approach for dealing with this is to suppose an underlying common odds ratio and to model between-study variability as extra-binomial variation. The table below shows the results of ten separate case-control studies on the association of breast cancer and whether a woman had breast fed children. How much greater are the odds of breast cancer for those who did not breast feed than for those who did breast feed? (Data gathered from various sources by Karolyn Kolassa as part of a Master’s project, Oregon State University.)

Data: ex2119.xls

22.25.  El Nino and Hurricanes. Reconsider the El Nino and Hurricane data set from exercise 10.28. Use Poisson log-linear regression to describe the distribution of (a) number of storms and (b) number of hurricanes as a function of El Nino temperature and West African wetness.

22.29. Body size and reproductive success in a population of male bullfrogs. As an example of field observation in evidence of theories of sexual selection, Arnold and Wade (1984, “On the measurement of natural and sexual selection: applications, Evolution, 38, p. 720-734) presented the following data set on size and number of mates observed in 38 male bullfrogs. Is there evidence that the distribution of number of mates in this population is related to body size? If so, supply a quantitative description of that relationship, along with an appropriate measure of uncertainty. Write a brief summary of statistical findings.

Data: ex2229.xls