|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
December 18, 2008How a Lurking Variable can Confuse Data Analysis "When the data don’t make sense, it’s usually because you have an erroneous preconception about how the system works." Ernest Beutler When you are unaware of the presence of a confounding variable, that variable is said to be lurking. This example illustrates the problem of lurking variables and the quotation above. I got the idea for this from the text by Freedman (reference below), but have extended it far beyond his example. The example uses synthetic data so seems a bit silly, but it makes an important point. Everyone knows what determines the area of a rectangle. But let’s pretend we don’t know. Furthermore, let’s pretend that we don’t know the height and width of each rectangle, but only know each rectangle’s perimeter and area. Our goal is to find a model that predicts the area of a rectangle from its perimeter. This graph shows that generally rectangles with a larger perimeters also have a larger area. Clearly, it seems, two outliers get in the way of seeing a clear relationship between perimeter and area. So let’s remove those two “outliers” and fit the remaining points to possible models. The straight line model (left panel) might be adequate, but the sigmoid shaped model (right panel) fit the data better.
If these were real data, you might think you were on the right track. After removing two outliers, we found a clear relationship and fit some models that seem useful. Now let’s collect data from more rectangles so we can refine the model.
Now it seems that those two “outliers” were really not so unusual. Instead it seems that there might be two distinct categories of rectangles. The right side of that figure tentatively identifies the two types of rectangles with open and closed circles and fits each to a different model. Definite progress, it seems. This process sort of feels like real science, and it seems as though we are moving forward. In fact, of course, it is all nonsense. Two rectangles with the same perimeter can have vastly different areas, depending on their shape. Predicting the area of a rectangle from its perimeter is simply impossible. We didn’t need better statistics or a better model. Hiring a statistical consultant wouldn’t have helped. The data only made sense once we understood the problem better, and realized that an important variable was missing.
December 16, 2008The Pros and Cons of Using Excel for Statistical Calculations Microsoft Excel is widely used, and is a great program for managing and wrangling data sets. Excel has some statistical capabilities, and many also use it to do some statistical calculations. Because these statistics functions are poorly documented, this chapter clarifies their use. The excellent book by Pace (2008) gives many more details (it can be purchased as a printed book, or as a pdf download). Use of Excel for statistics is somewhat controversial, and some recommend that Excel not be used for statistics. One problem is that Excel is far from a complete statistics program. It lacks nonparametric tests, post tests following ANOVA, and many others tests. Another problem is that Excel reports statistical results without all the supporting details other programs provide. More seriously, Excel uses some poor algorithms for computing statistics which can lead to incorrect results (McCullough,, 2005; Knusel, 2005). Microsoft responded to these criticisms, and fixed many issues in Excel 2003. There really is no point in using earlier versions of Excel for statistical work. Unfortunately, some errors remain in Excel 2007 for Windows and Excel 2008 for Mac. McCullough (2008) pointed out many erroneous results produced by Excel 2007 (especially its Solver) and concludes, "Microsoft has repeatedly proved itself incapable of providing reliable statistical functionality.” Yalta (2008) reached a similar conclusion, “the accuracy of various statistical functions in Excel 2007 range from unacceptably bad to acceptable but inferior.” In contrast, Pace (2008) concludes that the statistical errors produced by Excel 2007 are all trivial or obscure. He concludes that Excel 2007 is a reasonable choice for analyzing the kinds of data most academics and professionals collect. Given these problems, you should use another program to check important calculations, especially if your data seem unusual or include missing values.
References: Knusel, L. 2005. On the accuracy of statistical distributions in microsoft excel 2003. Computational Statistics & Data Analysis 48, (3): 445. McCullough, B. D., and D. A. Hellser. 2008. On the accuracy of statistical procedures in microsoft excel 2007. Computational Statistics & Data Analysis 52, (10): 4570. McCullough, B. D., and B. Wilson. 2005. On the accuracy of statistical procedures in microsoft excel 2003. Computational Statistics & Data Analysis 49, (4): 1244. Pace, L. A. 2008. The excel 2007 data & statistics cookbook. 2nd ed.TwoPaces LLC. Yalta, A. T. 2008. The accuracy of statistical distributions in Microsoft® Excel 2007. Computational Statistics & Data Analysis 52, (10): 4579.
I just discovered a paper by Donald Berry (1) that does a great job of explaining how pervasive the problem of multiple comparisons is. It goes way beyond post tests in ANOVA.
His first paragraph: Most scientists are oblivious to the problems of multiplicities. Yet they are everywhere. In one or more of its forms, multiplicities are present in every statistical application. They may be out in the open or hidden. And even if they are out in the open, recognizing them is but the first step in a difficult process of inference. Problems of multiplicities are the most difficult that we statisticians face. They threaten the validity of every statistical conclusion. 1. Berry. The difficult and ubiquitous problems of multiplicities. Pharmaceutical Statistics, 6: 155–160 (2007).
December 8, 2008Scientific Solutions web forum
While at the Neurosciences meeting in Washington DC, I learned about interesting free web discussion board for scientists. Check out Scientist Solutions.
The site contains forty discussion forums (by scientific discipline), where you may may post questions, answers, comments, ideas, and protocols, and where you may develop collaborative relationships.
December 5, 2008Two great statistical encylopedias These are the two statistical reference books I refer to most often, and recommend very highly. Sheskin has written a comprehensive (1736 pages) statistical encyclopedia. It gives every variation on every test, with detailed examples and tables. It has plenty of equations for those who want to do calculations themselves. But it is does not derive any equations or prove any theorems. It explains concepts in words (with examples), not equations. This makes it quite understandable by scientists (as well as statisticians). It is extremely well written. Although it purports to be comprehensive, no book really can be. It makes no mention of nonlinear regression, or model comparisons. Its coverage of survival curves is a bit weak (compared to the rest of the book), as is its coverage of modern (computer intensive) statistical methods. If you analyze data, you should have access to this book as a reference. No other book is so comprehensive and yet readable. Per page, it is a bargain. Maxwell and Delaney have written a great book on ANOVA. The title makes it seem like the book is more general, but it really is an advanced text of ANOVA, written in clear accessible language with plenty of examples. They emphasize the perspective of using ANOVA to compare models (rather than divide variation into its components). That perspective makes lots of sense to me, and often matches the scientific question the experiment was designed to answer. Some books leave you thinking that ANOVA is no more than a mathematical stunt. This one really approaches ANOVA as a way of thinking, used to answer experimental questions.
November 17, 2008Data torture and multiple comparisons. It is hard enough to interpret one statistical result. Interpreting multiple comparisons at once is harder, but necessary. Picking and choosing among many results makes all conclusions invalid. For statistical analyses to be interpretable, it is essential that all analyses be planned, and that all planned analyses are conducted and reported. These simple and sensible can be violated in many situations. Data Torture“Data torture” occurs when investigators, without a clear plan, analyze their data in many ways, desperately seeking “statistical significance” (1). Vickers told this story: Statistician: "Oh, so you have already calculated the P value?" Surgeon: "Yes, I used multinomial logistic regression." Statistician: "Really? How did you come up with that?" Surgeon: "Well, I tried each analysis on the SPSS drop-down menus, and that was the one that gave the smallest P value". Investigators have found many ways to torture data. Change the definition of the outcome. Use a different time scale. Try different criteria for including or excluding a subject. Arbitrarily decide which points to remove as outliers. Try different ways to clump or separate subgroups. Try different algorithms for computing statistical tests. Try different statistical tests. Fitting a multiple regression model provides even more opportunities for data torture. Include or exclude possible confounding variables. Include or exclude interactions. Change the definition of the outcome variable. If you try hard enough, eventually ‘statistically significant’ findings will emerge from any reasonably complicated data set. Since the number of possible comparisons is not defined in advance, and is almost unlimited, results from data torture cannot be interpreted except perhaps as a method to generate hypotheses to be tested in future studies. Torture by Editors -- Publication BiasEditors prefer to publish papers that report results that are statistically significant. Interpreting published results becomes problematic when studies with “not significant” conclusions are abandoned, while the ones with “statistically significant” results get published. This means that the chance of observing a ‘significant’ result in a published study can be much greater than 5% even if the null hypotheses are all true. Turner demonstrated this kind of selectivity -- called publication bias -- in industry-sponsored investigations of the efficacy of antidepressant drugs (2). Between 1987 and 2004, the Food and Drug Administration (FDA) reviewed 74 such studies, and categorized them as “positive”, “negative” or “questionable”. The FDA reviewers found that 38 studies showed a positive result (the antidepressant worked). All but one of these studies was published. The FDA reviewers found that the remaining 36 studies had negative or questionable results. Of these, 22 were not published, 11 were published with a ‘spin’ that made the results seem somewhat positive, and only 3 of these negative studies were published with clear negative findings. Studies that show ‘positive’ results are far more likely to be published than ones that reach negative or ambiguous conclusions. Selective publication makes it impossible to properly interpret the published literature. Multiple Time Points -- Sequential AnalysesTo properly interpret a P value, the experimental protocol has to be set in advance. Usually this means choosing a sample size, collecting data, and then analyzing it. But what if the results aren’t quite statistically significant? It is tempting to run the experiment a few more times (or add a few more subjects), and then analyze the data again, with the larger sample size. If the results still aren’t “significant”, then do the experiment a few more times (or add more subjects) and renanalyze once again. When data are analyzed in this way, it is impossible to interpret the results. This informal sequential approach should not be used. If the null hypothesis of no difference is in fact true, the chance of obtaining a “statistically significant” result using that informal sequential approach is far higher than 5%. In fact, if you carry on that approach long enough, then every single experiment will eventually reach a “significant” conclusion, even if the null hypothesis is true. Of course, “long enough” might be very long indeed and exceed your budget or even your lifespan. The problem is that the experiment continues when the result is not “significant”, but stops when the result is “significant”. If the experiment was continued after reaching “significance”, adding more data might then result in a “not significant” conclusion. But you’d never know this, because the experiment would have been terminated once “significance” was reached. If you keep running the experiment when you don’t like the results, but stop the experiment when you like the results, the results are impossible to interpret. Statisticians have developed rigorous ways to handle sequential data analysis. These methods use much more stringent criteria to define “significance” to make up for the multiple comparisons. Without these special methods, you can’t interpret the results unless the sample size is set in advance Multiple SubgroupsAnalyzing multiple subgroups of data is a form of multiple comparisons. When a treatment works in some subgroups but not others, analyses of subgroups becomes a form of multiple comparisons and it is easy to be fooled. A simulated study by Lee and coworkers points out the problem. They pretended to compare survival following two “treatments” for coronary artery disease. They studied a group of real patients with coronary artery disease who they randomly divided into two groups. In a real study, they would give the two groups different treatments, and compare survival. In this simulated study, they treated the subjects identically but analyzed the data as if the two random groups actually represented two distinct treatments. As expected, the survival of the two groups was indistinguishable (3). They then divided the patients into six groups depending on whether they had disease in one, two, or three coronary arteries, and depending on whether the heart ventricle contracted normally or not. Since these are variables that are expected to affect survival of the patients, it made sense to evaluate the response to “treatment” separately in each of the six subgroups. Whereas they found no substantial difference in five of the subgroups, they found a striking result among the sickest patients. The patients with three-vessel disease who also had impaired ventricular contraction had much better survival under treatment B than treatment A. The difference between the two survival curves was statistically significant with a P value less than 0.025. If this were an actual study, it would be tempting to conclude that treatment B is superior for the sickest patients, and to recommend treatment B to those patients in the future. But this was not a real study, and the two “treatments” reflected only random assignment of patients. The two treatments were identical, so the observed difference was absolutely positively due to chance. It is not surprising that the authors found one low P value out of six comparisons. There is a 26% chance that one of six independent comparisons will have a P value less than 0.05, even if all null hypotheses are true. If all the subgroup comparisons are defined in advance, it is possible to correct for many comparisons – either as part of the analysis or informally while interpreting the results. But when this kind of subgroup analysis is not defined in advance, it becomes a form of “data torture”. Multiple PredictionsIn 2000, the Intergovernmental Panel on Climate Change made predictions about future climate. Pielke asked what seemed like a straightforward question: How accurate were those predictions over the next seven years? That’s not long enough to seriously assess predictions of global warming, but it is a necessary first step. Answering this question proved to be impossible. The problems are that the report contained numerous predictions, and didn’t specify which sources of climate data should be used. Did the predictions come true? The answer depends on the choice of which prediction to test and which data set you test it against -- “a feast for cherry pickers”. You can only evaluate the accuracy of predictions or diagnoses when the prediction, and the method or data source to compare it with, is unambiguous. Combining GroupsWhen comparing two groups, the groups must be defined as part of the study design. If the groups are defined by the data, many comparisons are being made implicitly and ending the results cannot be interpreted. Austin and Goldwasser demonstrated this problem(4). They looked at the incidence of hospitalization for heart failure in Ontario (Canada) in twelve groups of patients defined by their astrological sign (based on their birthday). People born under the sign of Pisces happened to have the highest incidence of heart failure. They then did a simple statistics test to compare the incidence of heart failure among people born under Pisces with the incidence of heart failure among all others (born under all the other eleven signs, combined into one group). Taken at face value, this comparison showed that the difference in incidence rates is very unlikely to be due to chance (the P value was 0.026). Pisces have a “statistically significant” higher incidence of heart failure than do people born in the other eleven signs. The problem is that the investigators didn’t test really one hypothesis; they tested twelve. They only focused on Pisces after looking at the incidence of heart failure for people born under all twelve astrological signs. So it isn’t fair to compare that one group against the others, without considering the other eleven implicit comparisons. After correcting for those multiple comparisons, there was no significant association between astrological sign and heart failure. SummaryMultiple comparisons can be interpreted correctly only when all comparisons are planned, and all planned comparisons are published. These simple ideas are violated in many ways in common statistical practice.
References: 1. Mills, J. L. 1993. Data torturing. New England Journal of Medicine 329, (16): 1196.2. Turner, E. H., A. M. Matthews, E. Linardatos, R. A. Tell, and R. Rosenthal. 2008. Selective publication of antidepressant trials and its influence on apparent efficacy. The New England Journal of Medicine 358, (3) (Jan 17): 252-60. 3. Lee, K. L., J. F. McNeer, C. F. Starmer, P. J. Harris, and R. A. Rosati. 1980. Clinical judgment and statistics. lessons from a simulated randomized trial in coronary artery disease. Circulation 61, (3) (Mar): 508-15 4. Austin, P. C., and M. A. Goldwasser. 2008. Pisces did not have increased heart failure: Data-driven comparisons of binary proportions between levels of a categorical variable can result in incorrect statistical significance levels. Journal of Clinical Epidemiology 61, (3) (Mar): 295-300.
November 14, 2008The joy of EPS
When submitting Prism 5 graphs or layouts to a journal, Prism offers many export formats. One format that is often overlooked, but should be your first choice, is EPS. Compared to TIF files, EPS files are compact and crisp.
EPS files contain the same postscript information as PDF files, but with some headers that make them more compatible with the systems journals use to layout pages. EPS files are based on vectors and fonts (not bitmaps) so scale to any size. Of course, different journals use different systems and have different requirements. Many biological journals are produced by Cadmus, and we have heard that they accept EPS files from Prism. After creating a EPS file, and before sending it to a journal, you probably want to preview it. With a Mac, that is no problem. The Mac Preview program will let you view the EPS file (actually it converts it to PDF, and lets you preview it). With Windows, however, you won't be able to preview EPS files with standard software. However, EPS and PDF files are very similar (and are created by the same software module), so the solution is to also export in PDF format, and preview those files. The ability to export in EPS format is new to Prism 5.
November 13, 2008When to not correct for multiple comparisons Multiple comparisons can be accounted for with Bonferroni and other corrections, or by the approach of calculating the False Discover Rate. But these approaches are not always needed. Here are three situations were special calculations are not needed. Account for multiple comparisons when interpreting the results rather than in the calculations Some statisticians recommend never correcting for multiple comparisons while analyzing data (1). Instead report all of the individual P values and confidence intervals, and make it clear that no mathematical correction was made for multiple comparisons. This approach requires that all comparisons be reported. When you interpret these results, you need to informally account for multiple comparisons. If all the null hypotheses are true, you’d expect 5% of the comparisons to have uncorrected P values less than 0.05. Compare this number to the actual number of small P values. Corrections for multiple comparisons may not be needed if you make only a few planned comparisons Other statisticians recommend not doing any formal corrections for multiple comparisons when the study focuses on only a few scientifically sensible comparisons, rather than every possible comparison. The term planned comparison is used to describe this situation. These comparisons must be designed into the experiment, and cannot be decided upon after inspecting the data. Corrections for multiple comparisons are not needed when the comparisons are complementary Ridker and colleagues (2) asked whether lowering LDL cholesterol would prevent heart disease in patients who did not have high LDL concentrations and did not have a prior history of heart disease (but did have an abnormal blood test suggesting the presence of some inflammatory disease). They study included almost 18,000 people. Half received a statin drug to lower LDL cholesterol and half received placebo. The investigators primary goal (planned as part of the protocol) was to compare the number of “end points” that occurred in the two groups, including deaths from a heart attack or stroke, nonfatal heart attacks or strokes, and hospitalization for chest pain. These events happened about half as often to many people treated with the drug compared to people taking placebo. The drug worked. The investigators also analyzed each of the endpoints. Those taking the drug (compared to those taking placebo) had fewer deaths, and fewer heart attacks, and fewer strokes, and fewer hospitalizations for chest pain. The data from various demographic groups were then analyzed separately. Separate analyses were done for men and women, old and young, smokers and nonsmokers, people with hypertension and without, people with a family history of heart disease and those without. In each of 25 subgroups, patients receiving the drug experienced fewer primary endpoints than those taking placebo, and all these effects were statistically significant. The investigators made no correction for multiple comparisons for all these separate analyses of outcomes and subgroups. No corrections were needed, because the results are so consistent. The multiple comparisons each ask the same basic question a different way, and all the comparisons point to the same conclusion – people taking the drug had less cardiovascular disease than those taking placebo.
References 1. Rothman, K.J. (1990). No adjustments are needed for multiple comparisons.Epidemiology, 1: 43-46. 2. Ridker. Rosuvastatin to Prevent Vascular Events in Men and Women with Elevated C-Reactive Protein. N Engl J Med (2008) vol. 359 pp. 3195
November 12, 2008Using InStat's help with Windows Vista. The Windows version of InStat and StatMate were written before Vista, and use the older .HLP style of online help. Viewing this help requires the Windows Help Viewer, and this is no longer a standard part of Windows Vista. If you try to access help, Windows presents an error message with a link to a page on microsoft.com. Follow the instructions on that page to download and install the Windows Help program (WinHlp32.exe) for Windows Vista. Once that program is installed, the Help for InStat and StatMate will work just fine. Do note that you must be logged into Windows as an administrator to install that program. If you don't want to fuss with installing Windows components, all the same information that is in the Help system is also in this free InStat pdf manual .
October 20, 2008Is it better to plot graphs with SD or SEM error bars? Neither! There are better alternatives, depending on your goal. If you want to show the variation in your data: If each value represents a different individual, you probably want to show the variation among values. Even if each value represents a different lab experiment, it often makes sense to show the variation. With fewer than 100 or so values, create a scatter plot that shows every value. What better way to show the variation among values than to show every value? If your data set has more than 100 or so values, a scatter plot becomes messy. Alternatives are to show a box-and-whiskers plot, a frequency distribution (histogram), or a cumulative frequency distribution. What about plotting mean and SD? The SD does quantify variability, so this is indeed one way to graph variability. But a SD is only one value, so is a pretty limited way to show variation. A graph showing mean and SD error bar is less informative than any of the other alternatives, but takes no less space and is no easier to interpret. I see no advantage to plotting a mean and SD rather than a column scatter graph, box-and-wiskers plot, or a frequency distribution. Of course, if you do decide to show SD error bars, be sure to say so in the figure legend so no one will think it is a SEM. If you want to show how precisely you have determined the mean: If your goal is to compare means with a t test or ANOVA, or to show how closely our data come to the predictions of a model, you may be more interested in showing how precisely the data define the mean than in showing the variability. In this case, the best approach is to plot the 95% confidence interval of the mean (or perhaps a 90% or 99% confidence interval). What about the standard error of the mean (SEM)? Graphing the mean with an SEM error bars is a commonly used method to show how well you know the mean, The only advantage of SEM error bars are that they are shorter, but SEM error bars are harder to interpret than a confidence interval. Whatever error bars you choose to show, be sure to state your choice. If you want to create persuasive propoganda: If your goal is to emphasize small and unimportant differences in your data, show your error bars as SEM, and hope that your readers think they are SD If our goal is to cover-up large differences, show the error bars as the standard deviations for the groups, and hope that your readers think they are a standard errors. This approach was advocated by Steve Simon in his excellent weblog. Of course he meant it as a joke. If you don't understand the joke, review the differences between SD and SEM.
The confidence interval of a standard deviation. The SD of a sample is not the same as the SD of the population It is straightforward to calculate the standard deviation from a sample of values. But how accurate is that standard deviation? Just by chance you may have happened to obtain data that are closely bunched together, making the SD low. Or you may have randomly obtained values that are far more scattered than the overall population, making the SD high. The SD of your sample does not equal, and may be quite far from, the SD of the population. Confidence intervals are not just for means You are probably already familiar with a confidence interval of a mean. The idea of a confidence interval is very general, and you can express the precision of any computed value as a 95% confidence interval (CI). Another example is a confidence interval of a best-fit value from regression, for example a confidence interval of a slope. The 95% CI of the SD The SD is just a value you compute from data. It's not done often, but it is certainly possible to compute a CI for a SD. A free GraphPad QuickCalc does the work for you. Interpreting the CI of the SD is straightforward. If you assume that your data were randomly and independently sampled from a Gaussian distribution, you can be 95% sure that the CI computed from the sample SD contains the true population SD. How wide is the CI of the SD? Of course the answer depends on sample size (N). With small samples, the interval is quite wide as shown in the table below. N 95% CI of SD 2 0.45*SD to 31.9*SD 3 0.52*SD to 6.29*SD 5 0.60*SD to 2.87*SD 10 0.69*SD to 1.83*SD 25 0.78*SD to 1.39*SD 50 0.84*SD to 1.25*SD 100 0.88*SD to 1.16*SD 500 0.94*SD to 1.07*SD 1000 0.96*SD to 1.05*SD Example
The sample standard deviation computed from the five values shown in the graph above is 18.0. But the true standard deviation of the population from which the values were sampled might be quite different. From the n=5 row of the table, the 95% confidence interval extends from 0.60 times the SD to 2.87 times the SD. Thus the 95% confidence interval ranges from 0.60*18.0 to 2.87*18.0, from 10.8 to 51.7. When you compute a SD from only five values, the upper 95% confidence limit for the SD is almost five times the lower limit. Most people are surprised that small samples define the SD so poorly. Random sampling can have a huge impact with small data sets, resulting in a calculated standard deviation quite far from the true population standard deviation. Note that the confidence intervals are not symmetrical. Why? Since the SD is always a positive number, the lower confidence limit can't be less than zero. This means that the upper confidence interval usually extends further above the sample SD than the lower limit extends below the sample SD. With small samples, this asymmetry is quite noticeable. Computing the Ci of a SD with Excel These Excel equations compute the confidence interval of a SD. N is sample size; alpha is 0.05 for 95% confidence, 0.01 for 99% confidence, etc.: Lower limit: =SD*SQRT((N-1)/CHIINV((alpha/2), N-1)) Upper limit: =SD*SQRT((N-1)/CHIINV(1-(alpha/2), N-1))
Statistics with n=2 A first step towards analyzing data is often to compute the SD, SEM and confidence interval of the mean. It seems to be common lab folklore that these calculations are not valid for n=2. This page explains that this folklore is wrong. With only two values, there really is not much point in displaying a mean with SD or SEM, as you can display the actual data in the same amount of space. In fact, there are better alternatives to plotting either the SD or the SEM. But if you do want to show a SD or SEM, the equations that calculate the SD, SEM and CI all work just fine when you have only duplicate (N=2) data. Are the results valid? It is known that the sample SD computed from small samples underestimates, on average, the true population SD. But the discrepancy is small compared to random variability inherent in collecting tiny data sets. The discrepancy only applies to the SD. The variance, which is the SD squared, is unbiased even for n=2. To prove the validity of n=2 calculations, I simulated five thousand data sets with n=2, with each value randomly chosen from a Gaussian distribution (GraphPad QuickCalcs can do this, as can Excel). First I computed the 95% confidence intervals for each data set and asked whether the interval included the true value. When analyzing data, you can't answer this question. But here the data are simulated from a known population, so we know what the true population mean is. In 95.02% of these simulations, the confidence interval of the mean included the true population mean. So a confidence interval of a mean computed from a n=2 sample can be interpreted as it usually is. The only problem with having only duplicate data, is that the confidence interval is so very wide. Using the simulated data, I started to ask whether the calculated sample SD was a good estimate of the true SD. But it is known that the sample SD, on average, is too small when n is small. That doesn't really matter, since all statistical tests (t test, ANOVA) are actually based on the variance (the square of the SD). For these reasons, I used simulations to ask whether the sample variance from a n=2 sample is unbiased. For each of the 10,000 simulated data sets I computed the variance from the two values. The average of these 10,000 variances was within 1% of the true variance from which the data were simulated. This shows that the variance computed from n=2 data is a valid assessment of the scatter in your data, no less valid than a SD computed from data with larger n.
The SD computed from tiny samples underestimate the population SD (but not by much) The standard deviation (SD) quantifies scatter. The equation used to compute the sample SD (which uses n-1 in the denominator), underestimates the true population SD by a small amount. The following simulation demonstrates this. The graph shows the results of 400 simulations. Each simulation randomly sampled from a Gaussian population with mean=100 and SD=15. One hundred samples had only duplicate values (n=2, left panel of graph). Another 100 had n=3, n=10 and n=50. Each dot on the graph shows the SD of one randomly generated sample. The long horizontal line shows the true population SD, which is 15.0. The shorter horizontal lines show the mean of the SDs from the 100 simulated samples for each sample size. You can see that the mean SD is a bit too low for the n=2 and n=3 samples. This is not just a glitch due to random sampling, but rather is a consistent finding. The SD is the square root of variance. The equation that computes variance (with N-1 in the denominator) is correct. The average of the variances of these simulated samples are indeed very close to the true population variance. Taking the square root of the variances to compute the SD reduces the large variances more than the small, and the mean of the SDs underestimates the true population SD. An unbiased estimate of the population SD equals the computed sample SD divided by a quantity known as c4 (The c, I think, is for control chart; I don't know why it is called c4). The value of c4, of course, depends on sample size. it is computed with this Excel formula: =EXP(GAMMALN(N/2)+LN(SQRT(2/(N-1)))-GAMMALN((N-1)/2)) With n=2, the computed SD is too low by about 20%. With n=10, the discrepancy is only about 3%. Other values are tabulated below:
Prism and InStat compute the sample standard deviation without the correction detailed above. They don't even offer the option of including the c4 correction. Few programs do. Why is this correction commonly ignored?
October 16, 2008How does Prism compute the % of total variation in two-way ANOVA? As part of two-way ANOVA, Prism reports the % of total variation accounted for by the interaction, the column factor and the row factor. These values are computed by dividing the sum-of-squares from the ANOVA table by the total sum-of-squares. The three values do not total 100% because Prism does not report the % of total variation accounted for by the residual (or error) part of the ANOVA table. If that were included too, the percentages would add to 100. These values (% of total variation) are called standard omega squared by Sheskin (equations 27.51 - 27.53, and R2 by Maxwell and Delaney (page 295). Others call eta squared or the correlation ratio. Prism simply reports how the total sum of squares is partitioned into the various components in your particular sample of data. Like R2 in linear regression, this simply is a description of your data and not a best-guess of a parameter in the population. It is possible to compute the best-guess for the population value. This is called omega squared (distinguish from the standard omega squared), but Prism does not compute it.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||