Statistical Test Assumptions & Technical Details
Stats iQ selects statistical tests with the goal of making statistical testing intuitive and error-free.
This page describes overarching themes of Stats iQ’s approach, and the following describe specific decisions for specific tests:
Basic Assumptions
Whenever possible, Stats iQ defaults to tests that have fewer assumptions. For example, independent samples t-tests can be calculated in several ways, depending on whether equally sized samples or variances are assumed. Stats iQ runs the test with the least assumptions.
In addition, Stats iQ intelligently mitigates violations of the assumptions of statistical tests. For example, t-tests on relatively small samples require normally distributed data to be accurate. Outliers or non-normal distributions create misleading results. Every datapoint of
[1, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10]
is lower than every datapoint in
[11, 12, 13, 13, 14, 14, 15, 15, 15, 16, 16, 17, 17, 18, 19, 2000]
but an independent samples t-test on those groups does not yield a statistically significant difference because the outlier 2000 violates t-test assumptions. Stats iQ notices the outlier and recommends a ranked t-test instead, which does yield a very clear difference between the groups.
Rank Transformations
Stats iQ frequently uses the rank transform method for running nonparametric tests when violations of parametric test assumptions are detected. Stats iQ’s rank transformation replaces values with their rank ordering—for example
[86, 95, 40] is transformed to [2, 3, 1]
—then runs the typical parametric test on the transformed data. Tied values are given the average rank of the tied values, so
[11, 35, 35, 52] becomes [1, 2.5, 2.5, 4].
Most commonly encountered in the difference between Pearson and Spearman correlations, rank-transformed tests are robust to non-normal distributions and outliers, and are conceptually simpler than using slightly more common nonparametric tests.
ANOVA
When users select 1 categorical variable with 3 or more groups and 1 continuous or discrete variable, Stats iQ runs a one-way ANOVA (Welch’s F test) and a series of pairwise “post hoc” tests (Games-Howell tests). The one-way ANOVA tests for an overall relationship between the 2 variables, and the pairwise tests test each possible pair of groups to see if one group tends to have higher values than the other.
Assumptions of Welch’s F Test ANOVA
Stats iQ recommends an unranked Welch’s F test if several assumptions about the data hold:
- The sample size is greater than 10 times the number of groups in the calculation (groups with only 1 value are excluded), and therefore the Central Limit Theorem satisfies the requirement for normally distributed data.
- There are few or no outliers in the continuous/discrete data.
Unlike the slightly more common F test for equal variances, Welch’s F test does not assume that the variances of the groups being compared are equal. Assuming equal variances leads to less accurate results when variances are not in fact equal, and its results are very similar when variances are actually equal (Tomarken and Serlin, 1986).
Ranked ANOVA
When assumptions are violated, the unranked ANOVA may no longer be valid. In that case, Stats iQ recommends the ranked ANOVA (also called “ANOVA on ranks”); Stats iQ rank-transforms the data (replaces values with their rank ordering) and then runs the same ANOVA on that transformed data.
The ranked ANOVA is robust to outliers and non-normally distributed data. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and is most commonly seen in the difference between Pearson and Spearman correlation. Rank transformation followed by Welch’s F test is similar in effect to the Kruskal-Wallis Test (Zimmerman, 2012).
The effect size indicates whether or not the difference between the groups’ averages is large enough to have practical meaning, whether or not it is statistically significant. Note that Stats iQ’s ranked and unranked ANOVA effect sizes (Cohen’s f) are calculated using the F value from the F test for equal variances.
Assumptions of Games-Howell Pairwise Test
Stats iQ runs Games-Howell tests regardless of the outcome of the ANOVA test (as per Zimmerman, 2010). Stats iQ shows unranked or ranked Games-Howell pairwise tests based on the same criteria as those used for ranked vs. unranked ANOVA; so if you see “Ranked ANOVA” in the advanced output, the pairwise tests will also be ranked.
The Games-Howell is essentially a t-test for unequal variances that accounts for the heightened likelihood of finding statistically significant results by chance when running many pairwise tests. Unlike the slightly more common Tukey’s b test, the Games-Howell test does not assume that the variances of the groups being compared are equal. Assuming equal variances leads to less accurate results when variances are not in fact equal, and its results are very similar when variances are actually equal (Howell, 2012).
Note that while the unranked pairwise test tests for the equality of the means of the 2 groups, the ranked pairwise test does not explicitly test for differences between the groups’ means or medians. Rather, it tests for a general tendency of one group to have larger values than the other.
Additionally, while Stats iQ does not show results of pairwise tests for any group with less than 4 values, those groups are included in calculating the degrees of freedom for the other pairwise tests.
Additional ANOVA considerations
- With smaller sample sizes, data can still be visually inspected to determine if it is in fact normally distributed; if it is, unranked t-test results are still valid even for small samples. In practice, this assessment can be difficult to make, so Stats iQ recommends ranked t-tests by default for small samples.
- With larger sample sizes, outliers are less likely to negatively affect results. Stats iQ uses Tukey’s “outer fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.
- Data like Highest level of education completed or Finishing order in marathon are unambiguously ordinal. Though Likert scales (like a 1 to 7 scale where 1 is Very dissatisfied and 7 is Very satisfied) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., with an unranked t-test).
Stats iQ Contingency Tables
When users select 2 categorical variables, Stats iQ assesses whether those 2 variables are statistically related. Stats iQ runs Fisher’s exact test when possible, and otherwise runs Pearson’s chi-squared test (typically just called “chi-squared”).
Chi-squared vs. Fisher’s Exact Test
Fisher’s exact test is unbiased whenever it can be run, but it is computationally difficult to run if the table is greater than 2 x 2 or the sample size is greater than 10,000 (even with modern computing). Chi-squared tests can have biased results when sample sizes are low (technically, when expected cell counts are below 5).
Fortunately, the 2 tests are complementary in that Fisher’s exact test is typically easy to calculate when chi-squared tests are biased (small samples), and when Fisher’s exact test is difficult to calculate, chi-squared tends to be unbiased (large samples). Insomuch as larger tables with small samples can still create issues (and Stats iQ cannot run a Fisher’s exact test), Stats iQ alerts users to potential complications.
Adjusted Residuals
Like other statistical software, Stats iQ uses adjusted residuals to assess whether or not an individual cell is statistically significantly above or below expectations. Essentially the adjusted residual asks, “Does this cell have more values in it than I’d expect if there were no relationship between these 2 variables?”
If you have the data displayed such that each column sums to 100%, you can say “The proportion of Finance/Banking respondents who said they ‘Love their job’ is lower than typical, relative to respondents from other industries.”
Stats iQ shows up to 3 arrows, depending on the p-value calculated from the adjusted residual. Stats iQ will show a different number of arrows depending on the degree of significance of the result. Specifically, we show 1 arrow if the p-value is less than alpha (1 – confidence level), 2 arrows if the p-value is less than alpha/5, and 3 arrows if the p-value is less than alpha/50. For example, if your confidence level was set to 95%:
- p-value <= .05: 1 arrow
- p-value <= .01: 2 arrows
- p-value <= .001: 3 arrows
The calculation of the adjusted residual, and its comparison to specific alpha levels, can be labelled a “z-test” or a “z-test for a sample percentage.” Literature more typically simply says that conclusions were based on adjusted residuals.
Confidence Intervals
For all binomial confidence intervals, including contingency tables and in Category Describe bar charts, Stats iQ calculates the confidence interval using the Wilson Score Interval.
Stats iQ Correlations
When users select 2 continuous or discrete variables, Stats iQ runs a correlation to assess whether those 2 groups are statistically related. Stats iQ defaults to calculating Pearson’s r, the most common type of correlation; if the assumptions of that test are not met, Stats iQ recommends a ranked version of the same test, calculating Spearman’s rho. Additionally, Stats iQ uses the Fisher Transformation to calculate confidence intervals for the correlation coefficient.
Assumptions of Pearson’s r
Stats iQ recommends Pearson’s r as a valid measure of correlation if certain assumptions about the data are met:
- There are no outliers in the continuous/discrete data.
- The relationship between the variables is linear (e.g., y = 2x, not y = x^2).
Stats iQ does not display a line of best fit when it detects a violation of these assumptions.
Ranked Correlation (Spearman’s Rho)
When assumptions are violated, the Pearson’s r may no longer be a valid measure of correlation. In that case, Stats iQ recommends Spearman’s rho; Stats iQ rank-transforms the data (replaces values with their rank ordering) then runs the typical correlation. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and the rank transformation from Pearson to Spearman is the most common (Conover and Iman, 1981). Note that Spearman’s rho still assumes that the relationship between the variables is monotonic.
Additional Considerations for Correlations
- With larger sample sizes, outliers are less likely to negatively affect results. Stats iQ uses Tukey’s “outer fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.
- Stats iQ identifies a relationship as nonlinear when Spearman’s rho > 1.1 * Pearson’s r and Spearman’s rho are statistically significant.
- Though Likert scales (like a 1 to 7 scale where 1 is Very dissatisfied and 7 is Very satisfied) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., using Pearson’s r).
Independent Samples T-Test
This unranked t-test is the most common form of t-test. A t-test’s statistical significance indicates whether or not the difference 2 two groups’ averages most likely reflects a “real” difference in the population from which the groups were sampled.
A statistically significant t-test result is one in which a difference between 2 groups is unlikely to have occurred by accident or randomly. Statistical significance is determined by the size of the difference between the group averages, the sample size, and the standard deviations of the groups. For practical purposes, statistical significance suggests that the 2 populations from which we sample are actually different.
Example: Let’s say you’re interested in whether the average American spends more than the average Canadian per month on movies. You ask a sample of 3 people from each country about their movie spending. You may observe a difference in those averages, but that difference is not statistically significant; it could be random luck of who you randomly sampled that makes one group appear to spend more money than the other. If instead you ask 300 Americans and 300 Canadians and still see a big difference, that difference is less likely to be caused by the sample being unrepresentative.
Note that if you asked 300,000 Americans and 300,000 Canadians, the result would likely be statistically significant even if the difference between the group was only a penny. The t-test’s effect size complements its statistical significance, describing the magnitude of the difference, whether or not the difference is statistically significant.
Welch’s T-Test
When users want to relate a binary variable to a continuous or discrete variable, Stats iQ runs a two-tailed t-test (all statistical testing in Qualtrics is two-tailed, where applicable) to assess whether either of the 2 groups tends to have higher values than the other for the continuous/discrete variable. Stats iQ defaults to the Welch’s t-test, also known as the t-test for unequal variances; if the assumptions of that test are not met, Stats iQ recommends a ranked version of the same test.
Assumptions of Welch’s T-Test
Stats iQ recommends Welch’s t-test (hereafter “t-test”) if several assumptions about the data hold:
- The sample size of each group is above 15 (and therefore the Central Limit Theorem satisfies the requirement for normally distributed data).
- There are few or no outliers in the continuous/discrete data.
Unlike the slightly more common t-test for equal variances, Welch’s t-test does not assume that the variances of the 2 groups being compared are equal. Modern computing has made that assumption unnecessary. Furthermore, assuming equal variances leads to less accurate results when variances are not equal, and its results are no more accurate when variances are actually equal (Ruxton, 2006).
Ranked T-Test
When assumptions are violated, the t-test may no longer be valid. In that case, Stats iQ recommends the ranked t-test; Stats iQ rank-transforms the data (replaces values with their rank ordering) and then runs the same Welch’s t-test on that transformed data. The ranked t-test is robust to outliers and non-normally distributed data. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and is most commonly seen in the difference between Pearson and Spearman correlation (Conover and Iman, 1981). Rank transformation followed by Welch’s t-test is similar in effect to the Mann-Whitney U Test, but somewhat more efficient (Ruxton, 2006; Zimmerman, 2012).
Note that while the t-test tests for the equality of the means of the 2 groups, the ranked t-test does not explicitly test for differences between the groups means or medians. Rather, it tests for a general tendency of one group to have larger values than the other.
Other Considerations for T-Tests
- With sample sizes below 15, data can still be visually inspected to determine if it is normally distributed; if it is, unranked t-test results are still valid even for small samples. In practice, this assessment can be difficult to make, so Stats iQ recommends ranked t-tests by default for small samples.
- With larger sample sizes, outliers are less likely to negatively affect results. Stats iQ uses Tukey’s “outer fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.
- Data like “Highest level of education completed” or “Finishing order in a marathon” are unambiguously ordinal. Though Likert scales (like a 1 to 7 scale where 1 is Very dissatisfied and 7 is Very satisfied) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., with an unranked t-test).
Regression
There are 2 main types of regression run in Stats iQ. If the output variable is a numbers variable, Stats iQ will run a linear regression. If the output variable is a categories variable, Stats iQ will run a logistic regression. The default output for a linear regression is a combination of Relative Importance (specifically, Johnson’s Relative Weights) and Ordinary Least Squares. When running an “Ordinary Least Squares” regression, Stats iQ uses the variation called “M-estimation,” which is a more modern technique that dampens the effect of outliers, leading to more accurate results.
See more at Regression & Relative Importance.