Nonparametric Methods Nominal Level Hypothesis Statement
Creating a Data Analysis Plan: What to Consider When Choosing Statistics for a Study
Scot H Simpson
Scot H Simpson, BSP, PharmD, MSc, is Professor and Associate Dean, Research and Graduate Studies, Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, Alberta. He is also an Associate Editor with the CJHP.
Address correspondence to: Scot H Simpson, Faculty of Pharmacy and Pharmaceutical Sciences, 3126 Dentistry/Pharmacy, University of Alberta, Edmonton AB T6G 2N8, email: ac.atreblau@tocs
Author information ►Copyright and License information ►
Copyright 2015 Canadian Society of Hospital Pharmacists. All content in the Canadian Journal of Hospital Pharmacy is copyrighted by the Canadian Society of Hospital Pharmacy. In submitting their manuscripts, the authors transfer, assign, and otherwise convey all copyright ownership to CSHP.
Can J Hosp Pharm. 2015 JulAug; 68(4): 311–317.
There are three kinds of lies: lies, damned lies, and statistics.
– Mark Twain^{1}
INTRODUCTION
Statistics represent an essential part of a study because, regardless of the study design, investigators need to summarize the collected information for interpretation and presentation to others. It is therefore important for us to heed Mr Twain’s concern when creating the data analysis plan. In fact, even before data collection begins, we need to have a clear analysis plan that will guide us from the initial stages of summarizing and describing the data through to testing our hypotheses.
The purpose of this article is to help you create a data analysis plan for a quantitative study. For those interested in conducting qualitative research, previous articles in this Research Primer series have provided information on the design and analysis of such studies.^{2}^{,}^{3} Information in the current article is divided into 3 main sections: an overview of terms and concepts used in data analysis, a review of common methods used to summarize study data, and a process to help identify relevant statistical tests. My intention here is to introduce the main elements of data analysis and provide a place for you to start when planning this part of your study. Biostatistical experts, textbooks, statistical software packages, and other resources can certainly add more breadth and depth to this topic when you need additional information and advice.
TERMS AND CONCEPTS USED IN DATA ANALYSIS
When analyzing information from a quantitative study, we are often dealing with numbers; therefore, it is important to begin with an understanding of the source of the numbers. Let us start with the term variable, which defines a specific item of information collected in a study. Examples of variables include age, sex or gender, ethnicity, exercise frequency, weight, treatment group, and blood glucose. Each variable will have a group of categories, which are referred to as values, to help describe the characteristic of an individual study participant. For example, the variable “sex” would have values of “male” and “female”.
Although variables can be defined or grouped in various ways, I will focus on 2 methods at this introductory stage. First, variables can be defined according to the level of measurement. The categories in a nominal variable are names, for example, male and female for the variable “sex”; white, Aboriginal, black, Latin American, South Asian, and East Asian for the variable “ethnicity”; and intervention and control for the variable “treatment group”. Nominal variables with only 2 categories are also referred to as dichotomous variables because the study group can be divided into 2 subgroups based on information in the variable. For example, a study sample can be split into 2 groups (patients receiving the intervention and controls) using the dichotomous variable “treatment group”. An ordinal variable implies that the categories can be placed in a meaningful order, as would be the case for exercise frequency (never, sometimes, often, or always). Nominallevel and ordinallevel variables are also referred to as categorical variables, because each category in the variable can be completely separated from the others. The categories for an interval variable can be placed in a meaningful order, with the interval between consecutive categories also having meaning. Age, weight, and blood glucose can be considered as interval variables, but also as ratio variables, because the ratio between values has meaning (e.g., a 15yearold is half the age of a 30yearold). Intervallevel and ratiolevel variables are also referred to as continuous variables because of the underlying continuity among categories.
As we progress through the levels of measurement from nominal to ratio variables, we gather more information about the study participant. The amount of information that a variable provides will become important in the analysis stage, because we lose information when variables are reduced or aggregated—a common practice that is not recommended.^{4} For example, if age is reduced from a ratiolevel variable (measured in years) to an ordinal variable (categories of < 65 and ≥ 65 years) we lose the ability to make comparisons across the entire age range and introduce error into the data analysis.^{4}
A second method of defining variables is to consider them as either dependent or independent. As the terms imply, the value of a dependent variable depends on the value of other variables, whereas the value of an independent variable does not rely on other variables. In addition, an investigator can influence the value of an independent variable, such as treatmentgroup assignment. Independent variables are also referred to as predictors because we can use information from these variables to predict the value of a dependent variable. Building on the group of variables listed in the first paragraph of this section, blood glucose could be considered a dependent variable, because its value may depend on values of the independent variables age, sex, ethnicity, exercise frequency, weight, and treatment group.
Statistics are mathematical formulae that are used to organize and interpret the information that is collected through variables. There are 2 general categories of statistics, descriptive and inferential. Descriptive statistics are used to describe the collected information, such as the range of values, their average, and the most common category. Knowledge gained from descriptive statistics helps investigators learn more about the study sample. Inferential statistics are used to make comparisons and draw conclusions from the study data. Knowledge gained from inferential statistics allows investigators to make inferences and generalize beyond their study sample to other groups.
Before we move on to specific descriptive and inferential statistics, there are 2 more definitions to review. Parametric statistics are generally used when values in an intervallevel or ratiolevel variable are normally distributed (i.e., the entire group of values has a bellshaped curve when plotted by frequency). These statistics are used because we can define parameters of the data, such as the centre and width of the normally distributed curve. In contrast, intervallevel and ratiolevel variables with values that are not normally distributed, as well as nominallevel and ordinallevel variables, are generally analyzed using nonparametric statistics.
METHODS FOR SUMMARIZING STUDY DATA: DESCRIPTIVE STATISTICS
The first step in a data analysis plan is to describe the data collected in the study. This can be done using figures to give a visual presentation of the data and statistics to generate numeric descriptions of the data.
Selection of an appropriate figure to represent a particular set of data depends on the measurement level of the variable. Data for nominallevel and ordinallevel variables may be interpreted using a pie graph or bar graph. Both options allow us to examine the relative number of participants within each category (by reporting the percentages within each category), whereas a bar graph can also be used to examine absolute numbers. For example, we could create a pie graph to illustrate the proportions of men and women in a study sample and a bar graph to illustrate the number of people who report exercising at each level of frequency (never, sometimes, often, or always).
Intervallevel and ratiolevel variables may also be interpreted using a pie graph or bar graph; however, these types of variables often have too many categories for such graphs to provide meaningful information. Instead, these variables may be better interpreted using a histogram. Unlike a bar graph, which displays the frequency for each distinct category, a histogram displays the frequency within a range of continuous categories. Information from this type of figure allows us to determine whether the data are normally distributed. In addition to pie graphs, bar graphs, and histograms, many other types of figures are available for the visual representation of data. Interested readers can find additional types of figures in the books recommended in the “Further Readings” section.
Figures are also useful for visualizing comparisons between variables or between subgroups within a variable (for example, the distribution of blood glucose according to sex). Box plots are useful for summarizing information for a variable that does not follow a normal distribution. The lower and upper limits of the box identify the interquartile range (or 25th and 75th percentiles), while the midline indicates the median value (or 50th percentile). Scatter plots provide information on how the categories for one continuous variable relate to categories in a second variable; they are often helpful in the analysis of correlations.
In addition to using figures to present a visual description of the data, investigators can use statistics to provide a numeric description. Regardless of the measurement level, we can find the mode by identifying the most frequent category within a variable. When summarizing nominallevel and ordinallevel variables, the simplest method is to report the proportion of participants within each category.
The choice of the most appropriate descriptive statistic for intervallevel and ratiolevel variables will depend on how the values are distributed. If the values are normally distributed, we can summarize the information using the parametric statistics of mean and standard deviation. The mean is the arithmetic average of all values within the variable, and the standard deviation tells us how widely the values are dispersed around the mean. When values of intervallevel and ratiolevel variables are not normally distributed, or we are summarizing information from an ordinallevel variable, it may be more appropriate to use the nonparametric statistics of median and range. The first step in identifying these descriptive statistics is to arrange study participants according to the variable categories from lowest value to highest value. The range is used to report the lowest and highest values. The median or 50th percentile is located by dividing the number of participants into 2 groups, such that half (50%) of the participants have values above the median and the other half (50%) have values below the median. Similarly, the 25th percentile is the value with 25% of the participants having values below and 75% of the participants having values above, and the 75th percentile is the value with 75% of participants having values below and 25% of participants having values above. Together, the 25th and 75th percentiles define the interquartile range.
PROCESS TO IDENTIFY RELEVANT STATISTICAL TESTS: INFERENTIAL STATISTICS
One caveat about the information provided in this section: selecting the most appropriate inferential statistic for a specific study should be a combination of following these suggestions, seeking advice from experts, and discussing with your coinvestigators. My intention here is to give you a place to start a conversation with your colleagues about the options available as you develop your data analysis plan.
There are 3 key questions to consider when selecting an appropriate inferential statistic for a study: What is the research question? What is the study design? and What is the level of measurement? It is important for investigators to carefully consider these questions when developing the study protocol and creating the analysis plan. The figures that accompany these questions show decision trees that will help you to narrow down the list of inferential statistics that would be relevant to a particular study. Appendix 1 provides brief definitions of the inferential statistics named in these figures. Additional information, such as the formulae for various inferential statistics, can be obtained from textbooks, statistical software packages, and biostatisticians.
What Is the Research Question?
The first step in identifying relevant inferential statistics for a study is to consider the type of research question being asked. You can find more details about the different types of research questions in a previous article in this Research Primer series that covered questions and hypotheses.^{5} A relational question seeks information about the relationship among variables; in this situation, investigators will be interested in determining whether there is an association (Figure 1). A causal question seeks information about the effect of an intervention on an outcome; in this situation, the investigator will be interested in determining whether there is a difference (Figure 2).
Figure 1.
Decision tree to identify inferential statistics for an association.
Figure 2.
Decision tree to identify inferential statistics for measuring a difference.
What Is the Study Design?
When considering a question of association, investigators will be interested in measuring the relationship between variables (Figure 1). A study designed to determine whether there is consensus among different raters will be measuring agreement. For example, an investigator may be interested in determining whether 2 raters, using the same assessment tool, arrive at the same score. Correlation analyses examine the strength of a relationship or connection between 2 variables, like age and blood glucose. Regression analyses also examine the strength of a relationship or connection; however, in this type of analysis, one variable is considered an outcome (or dependent variable) and the other variable is considered a predictor (or independent variable). Regression analyses often consider the influence of multiple predictors on an outcome at the same time. For example, an investigator may be interested in examining the association between a treatment and blood glucose, while also considering other factors, like age, sex, ethnicity, exercise frequency, and weight.
When considering a question of difference, investigators must first determine how many groups they will be comparing. In some cases, investigators may be interested in comparing the characteristic of one group with that of an external reference group. For example, is the mean age of study participants similar to the mean age of all people in the target group? If more than one group is involved, then investigators must also determine whether there is an underlying connection between the sets of values (or samples) to be compared. Samples are considered independent or unpaired when the information is taken from different groups. For example, we could use an unpaired t test to compare the mean age between 2 independent samples, such as the intervention and control groups in a study. Samples are considered related or paired if the information is taken from the same group of people, for example, measurement of blood glucose at the beginning and end of a study. Because blood glucose is measured in the same people at both time points, we could use a paired t test to determine whether there has been a significant change in blood glucose.
What Is the Level of Measurement?
As described in the first section of this article, variables can be grouped according to the level of measurement (nominal, ordinal, or interval). In most cases, the independent variable in an inferential statistic will be nominal; therefore, investigators need to know the level of measurement for the dependent variable before they can select the relevant inferential statistic. Two exceptions to this consideration are correlation analyses and regression analyses (Figure 1). Because a correlation analysis measures the strength of association between 2 variables, we need to consider the level of measurement for both variables. Regression analyses can consider multiple independent variables, often with a variety of measurement levels. However, for these analyses, investigators still need to consider the level of measurement for the dependent variable.
Selection of inferential statistics to test intervallevel variables must include consideration of how the data are distributed. An underlying assumption for parametric tests is that the data approximate a normal distribution. When the data are not normally distributed, information derived from a parametric test may be wrong.^{6} When the assumption of normality is violated (for example, when the data are skewed), then investigators should use a nonparametric test. If the data are normally distributed, then investigators can use a parametric test.
ADDITIONAL CONSIDERATIONS
What Is the Level of Significance?
An inferential statistic is used to calculate a p value, the probability of obtaining the observed data by chance. Investigators can then compare this p value against a prespecified level of significance, which is often chosen to be 0.05. This level of significance represents a 1 in 20 chance that the observation is wrong, which is considered an acceptable level of error.
What Are the Most Commonly Used Statistics?
In 1983, Emerson and Colditz^{7} reported the first review of statistics used in original research articles published in the New England Journal of Medicine. This review of statistics used in the journal was updated in 1989 and 2005,^{8} and this type of analysis has been replicated in many other journals.^{9}^{–}^{13} Collectively, these reviews have identified 2 important observations. First, the overall sophistication of statistical methodology used and reported in studies has grown over time, with survival analyses and multivariable regression analyses becoming much more common. The second observation is that, despite this trend, 1 in 4 articles describe no statistical methods or report only simple descriptive statistics. When inferential statistics are used, the most common are t tests, contingency table tests (for example, χ^{2} test and Fisher exact test), and simple correlation and regression analyses. This information is important for educators, investigators, reviewers, and readers because it suggests that a good foundational knowledge of descriptive statistics and common inferential statistics will enable us to correctly evaluate the majority of research articles.^{11}^{–}^{13} However, to fully take advantage of all research published in highimpact journals, we need to become acquainted with some of the more complex methods, such as multivariable regression analyses.^{8}^{,}^{13}
What Are Some Additional Resources?
As an investigator and Associate Editor with CJHP, I have often relied on the advice of colleagues to help create my own analysis plans and review the plans of others. Biostatisticians have a wealth of knowledge in the field of statistical analysis and can provide advice on the correct selection, application, and interpretation of these methods. Colleagues who have “been there and done that” with their own data analysis plans are also valuable sources of information. Identify these individuals and consult with them early and often as you develop your analysis plan.
Another important resource to consider when creating your analysis plan is textbooks. Numerous statistical textbooks are available, differing in levels of complexity and scope. The titles listed in the “Further Reading” section are just a few suggestions. I encourage interested readers to look through these and other books to find resources that best fit their needs. However, one crucial book that I highly recommend to anyone wanting to be an investigator or peer reviewer is Lang and Secic’s How to Report Statistics in Medicine (see “Further Reading”). As the title implies, this book covers a wide range of statistics used in medical research and provides numerous examples of how to correctly report the results.
CONCLUSIONS
When it comes to creating an analysis plan for your project, I recommend following the sage advice of Douglas Adams in The Hitchhiker’s Guide to the Galaxy: Don’t panic!^{14} Begin with simple methods to summarize and visualize your data, then use the key questions and decision trees provided in this article to identify relevant statistical tests. Information in this article will give you and your coinvestigators a place to start discussing the elements necessary for developing an analysis plan. But do not stop there! Use advice from biostatisticians and more experienced colleagues, as well as information in textbooks, to help create your analysis plan and choose the most appropriate statistics for your study. Making careful, informed decisions about the statistics to use in your study should reduce the risk of confirming Mr Twain’s concern.
Appendix 1. Glossary of statistical terms^{*} (part 1 of 2)
 ANOVA (analysis of variance):
 Parametric statistic used to compare the means of 3 or more groups that are defined by 1 or more variables.
1way ANOVA: Uses 1 variable to define the groups for comparing means. This is similar to the Student t test when comparing the means of 2 groups.
Kruskall–Wallis 1way ANOVA: Nonparametric alternative for the 1way ANOVA. Used to determine the difference in medians between 3 or more groups.
nway ANOVA: Uses 2 or more variables to define groups when comparing means. Also called a “betweensubjects factorial ANOVA”.
Repeatedmeasures ANOVA: A method for analyzing whether the means of 3 or more measures from the same group of participants are different.
Freidman ANOVA: Nonparametric alternative for the repeatedmeasures ANOVA. It is often used to compare rankings and preferences that are measured 3 or more times.
 Binomial test:
 Used to determine whether the observed proportion is significantly different from a known or hypothesized proportion. The variable is dichotomous (nominallevel data with 2 options).
 Biserial correlation (rank or point):
 Correlation technique when one of the variables is dichotomous (or measured at the nominal level).
 Chisquare (χ^{2}) test:
 Nonparametric test used to determine whether a statistically significant association exists between rows and columns in a contingency table.
Fisher exact: Variation of chisquare that accounts for cell counts < 5.
McNemar: Variation of chisquare that tests statistical significance of changes in 2 paired measurements of dichotomous variables.
Cochran Q: An extension of the McNemar test that provides a method for testing for differences between 3 or more matched sets of frequencies or proportions. Often used as a measure of heterogeneity in metaanalyses.
 Descriptive statistics:
 Numeric or graphic summaries (or descriptions) of a variable.
 Inferential statistics:
 Measures the difference between 2 variables or subgroups of a variable. Allows the investigator to make inferences about another group on the basis of information generated from the study data.
 Kappa (κ):
 Measures the degree of nonrandom agreement between observers or measurements for the same nominallevel variable.
 Kendall tau (τ):
 Nonparametric alternative for the Spearman correlation. Used when measuring the relationship between 2 ranked (or ordinallevel data) variables.
 Mann–Whitney U test:
 Nonparametric alternative for the independent t test. One variable is dichotomous (e.g., group A versus group B) and the other variable is either ordinal or interval.
 Pearson correlation:
 Parametric test used to determine whether an association exists between 2 variables measured at the interval or ratio level.
 Phi (ϕ):
 Used when both variables in a correlation analysis are dichotomous.
 Runs test:
 Used to determine whether a series of data occurs from a random process.
 Spearman rank correlation:
 Nonparametric alternative for the Pearson correlation coefficient. Used when the assumptions for Pearson correlation are violated (e.g., data are not normally distributed) or one of the variables is measured at the ordinal level.
 t test:
 Parametric statistical test for comparing the means of 2 independent groups.
1sample: Used to determine whether the mean of a sample is significantly different from a known or hypothesized value.
Independentsamples t test (also referred to as the Student t test): Used when the independent variable is a nominallevel variable that identifies 2 groups and the dependent variable is an intervallevel variable.
Paired: Used to compare 2 pairs of scores between 2 groups (e.g., baseline and followup blood pressure in the intervention and control groups).
 Wilcoxon rank–sum test:
 Nonparametric alternative to the independent t test based solely on the order in which observations from the 2 samples fall. Similar to the Mann–Whitney U test.
 Wilcoxon signedrank test:
 Nonparametric alternative to the paired t test. The differences between matched pairs are computed and ranked. This test compares the sum of the negative differences and the sum of the positive differences.
^{*}Sources
Lang TA, Secic M. How to report statistics in medicine: annotated guidelines for authors, editors, and reviewers. 2nd ed. Philadelphia (PA): American College of Physicians; 2006.
Norman GR, Streiner DL. PDQ statistics. 3rd ed. Hamilton (ON): B.C. Decker; 2003.
Plichta SB, Kelvin E. Munro’s statistical methods for health care research. 6th ed. Philadelphia (PA): Wolters Kluwer Health/ Lippincott, Williams & Wilkins; 2013.
Notes
This article is the 12th in the CJHP Research Primer Series, an initiative of the CJHP Editorial Board and the CSHP Research Committee. The planned 2year series is intended to appeal to relatively inexperienced researchers, with the goal of building research capacity among practising pharmacists. The articles, presenting simple but rigorous guidance to encourage and support novice researchers, are being solicited from authors with appropriate expertise.
Previous articles in this series:
Bond CM. The research jigsaw: how to get started. Can J Hosp Pharm. 2014;67(1):28–30.
Tully MP. Research: articulating questions, generating hypotheses, and choosing study designs. Can J Hosp Pharm. 2014;67(1):31–4.
Loewen P. Ethical issues in pharmacy practice research: an introductory guide. Can J Hosp Pharm. 2014;67(2):133–7.
Tsuyuki RT. Designing pharmacy practice research trials. Can J Hosp Pharm. 2014;67(3):226–9.
Bresee LC. An introduction to developing surveys for pharmacy practice research. Can J Hosp Pharm. 2014;67(4):286–91.
Gamble JM. An introduction to the fundamentals of cohort and case–control studies. Can J Hosp Pharm. 2014;67(5):366–72.
Austin Z, Sutton J. Qualitative research: getting started. Can J Hosp Pharm. 2014;67(6):436–40.
Houle S. An introduction to the fundamentals of randomized controlled trials in pharmacy research. Can J Hosp Pharm. 2014; 68(1):28–32.
Charrois TL. Systematic reviews: What do you need to know to get started? Can J Hosp Pharm. 2014;68(2):144–8.
Sutton J, Austin Z. Qualitative research: data collection, analysis, and management. Can J Hosp Pharm. 2014;68(3):226–31.
Cadarette SM, Wong L. An introduction to health care administrative data. Can J Hosp Pharm. 2014;68(3):232–7.
Footnotes
Competing interests: None declared.
References
1. Twain M. In: Mark Twain’s own autobiography: the chapters from the North American review. 2nd ed. Kiskis MJ, editor. Madison (WI): University of Wisconsin Press; 2010. p. 318.
2. Austin Z, Sutton J. Qualitative research: getting started. Can J Hosp Pharm. 2014;67(6):436–40.[PMC free article][PubMed]
3. Sutton J, Austin Z. Qualitative research: data collection, analysis, and management. Can J Hosp Pharm. 2015;68(3):226–31.[PMC free article][PubMed]
4. Dawson NV, Weiss R. Dichotomizing continuous variables in statistical analysis: a practice to avoid. Med Decis Making. 2012;32(2):225–6. doi: 10.1177/0272989X12437605.[PubMed][Cross Ref]
5. Tully MP. Research: articulating questions, generating hypotheses, and choosing study designs. Can J Hosp Pharm. 2014;67(1):31–4.[PMC free article][PubMed]
6. Harwell MR. Choosing between parametric and nonparametric tests. J Couns Dev. 1988;67(1):35–8. doi: 10.1002/j.15566676.1988.tb02007.x.[Cross Ref]
7. Emerson JD, Colditz GA. Use of statistical analysis in the New England Journal of Medicine. N Engl J Med. 1983;309(12):709–13. doi: 10.1056/NEJM198309223091206.[PubMed][Cross Ref]
8. Horton NJ, Switzer SS. Statistical methods in the journal. N Engl J Med. 2005;353(18):1977–9. doi: 10.1056/NEJM200511033531823.[PubMed][Cross Ref]
9. Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 1. Hypothesis testing. CMAJ. 1995;152(1):27–32.[PMC free article][PubMed]
10. Goldin J, Zhu W, Sayre JW. A review of the statistical analysis used in papers published in Clinical Radiology and British Journal of Radiology. Clin Radiol. 1996;51(1):47–50. doi: 10.1016/S00099260(96)802194.[PubMed][Cross Ref]
11. Reed JF, 3rd, Salen P, Bagher P. Methodological and statistical techniques: what do residents really need to know about statistics? J Med Syst. 2003;27(3):233–8. doi: 10.1023/A:1022519227039.[PubMed][Cross Ref]
12. Hellems MA, Gurka MJ, Hayden GF. Statistical literacy for readers of Pediatrics: a moving target. Pediatrics. 2007;119(6):1083–8. doi: 10.1542/peds.20062330.[PubMed][Cross Ref]
13. Taback N, Krzyzanowska MK. A survey of abstracts of highimpact clinical journals indicated most statistical methods presented are summary statistics. J Clin Epidemiol. 2008;61(3):277–81. doi: 10.1016/j.jclinepi.2007.05.003.[PubMed][Cross Ref]
14. Adams D. The hitchhiker’s guide to the galaxy. London (UK): Pan Books; 1979.
Further Reading
 Devor J, Peck R. Statistics: the exploration and analysis of data. 7th ed. Boston (MA): Brooks/Cole Cengage Learning; 2012.
 Lang TA, Secic M. How to report statistics in medicine: annotated guidelines for authors, editors, and reviewers. 2nd ed. Philadelphia (PA): American College of Physicians; 2006.
 Mendenhall W, Beaver RJ, Beaver BM. Introduction to probability and statistics. 13th ed. Belmont (CA): Brooks/Cole Cengage Learning; 2009.
 Norman GR, Streiner DL. PDQ statistics. 3rd ed. Hamilton (ON): B.C. Decker; 2003.
 Plichta SB, Kelvin E. Munro’s statistical methods for health care research. 6th ed. Philadelphia (PA): Wolters Kluwer Health/Lippincott, Williams & Wilkins; 2013.
Articles from The Canadian Journal of Hospital Pharmacy are provided here courtesy of Canadian Society Of Hospital Pharmacists
Nonparametric Tests
Author:
Lisa Sullivan, PhD
Professor of Biostatistics
Boston University School of Public Health
Introduction
The three modules on hypothesis testing presented a number of tests of hypothesis for continuous, dichotomous and discrete outcomes. Tests for continuous outcomes focused on comparing means, while tests for dichotomous and discrete outcomes focused on comparing proportions. All of the tests presented in the modules on hypothesis testing are called parametric tests and are based on certain assumptions. For example, when running tests of hypothesis for means of continuous outcomes, all parametric tests assume that the outcome is approximately normally distributed in the population. This does not mean that the data in the observed sample follows a normal distribution, but rather that the outcome follows a normal distribution in the full population which is not observed. For many outcomes, investigators are comfortable with the normality assumption (i.e., most of the observations are in the center of the distribution while fewer are at either extreme). It also turns out that many statistical tests are robust, which means that they maintain their statistical properties even when assumptions are not entirely met. Tests are robust in the presence of violations of the normality assumption when the sample size is large based on the Central Limit Theorem (see page 11 in the module on Probability). When the sample size is small and the distribution of the outcome is not known and cannot be assumed to be approximately normally distributed, then alternative tests called nonparametric tests are appropriate.
Learning Objectives
After completing this module, the student will be able to:
 Compare and contrast parametric and nonparametric tests
 Identify multiple applications where nonparametric approaches are appropriate
 Perform and interpret the Mann Whitney U Test
 Perform and interpret the Sign test and Wilcoxon Signed Rank Test
 Compare and contrast the Sign test and Wilcoxon Signed Rank Test
 Perform and interpret the Kruskal Wallis test
 Identify the appropriate nonparametric hypothesis testing procedure based on type of outcome variable and number of samples
When to Use a Nonparametric Test
Nonparametric tests are sometimes called distributionfree tests because they are based on fewer assumptions (e.g., they do not assume that the outcome is approximately normally distributed). Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests involve estimation of the key parameters of that distribution (e.g., the mean or difference in means) from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less powerful than their parametric counterparts (i.e., when the alternative is true, they may be less likely to reject H_{0}).
It can sometimes be difficult to assess whether a continuous outcome follows a normal distribution and, thus, whether a parametric or nonparametric test is appropriate. There are several statistical tests that can be used to assess whether data are likely from a normal distribution. The most popular are the KolmogorovSmirnov test, the AndersonDarling test, and the ShapiroWilk test^{1}. Each test is essentially a goodness of fit test and compares observed data to quantiles of the normal (or other specified) distribution. The null hypothesis for each test is H_{0}: Data follow a normal distribution versus H_{1}: Data do not follow a normal distribution. If the test is statistically significant (e.g., p<0.05), then data do not follow a normal distribution, and a nonparametric test is warranted. It should be noted that these tests for normality can be subject to low power. Specifically, the tests may fail to reject H_{0}: Data follow a normal distribution when in fact the data do not follow a normal distribution. Low power is a major issue when the sample size is small  which unfortunately is often when we wish to employ these tests. The most practical approach to assessing normality involves investigating the distributional form of the outcome in the sample using a histogram and to augment that with data from other studies, if available, that may indicate the likely distribution of the outcome in the population.
There are some situations when it is clear that the outcome does not follow a normal distribution. These include situations:
 when the outcome is an ordinal variable or a rank,
 when there are definite outliers or
 when the outcome has clear limits of detection.
Using an Ordinal Scale
Consider a clinical trial where study participants are asked to rate their symptom severity following 6 weeks on the assigned treatment. Symptom severity might be measured on a 5 point ordinal scale with response options: Symptoms got much worse, slightly worse, no change, slightly improved, or much improved. Suppose there are a total of n=20 participants in the trial, randomized to an experimental treatment or placebo, and the outcome data are distributed as shown in the figure below.
Distribution of Symptom Severity in Total Sample
The distribution of the outcome (symptom severity) does not appear to be normal as more participants report improvement in symptoms as opposed to worsening of symptoms.
When the Outcome is a Rank
In some studies, the outcome is a rank. For example, in obstetrical studies an APGAR score is often used to assess the health of a newborn. The score, which ranges from 110, is the sum of five component scores based on the infant's condition at birth. APGAR scores generally do not follow a normal distribution, since most newborns have scores of 7 or higher (normal range).
When There Are Outliers
In some studies, the outcome is continuous but subject to outliers or extreme values. For example, days in the hospital following a particular surgical procedure is an outcome that is often subject to outliers. Suppose in an observational study investigators wish to assess whether there is a difference in the days patients spend in the hospital following liver transplant in forprofit versus nonprofit hospitals. Suppose we measure days in the hospital following transplant in n=100 participants, 50 from forprofit and 50 from nonprofit hospitals. The number of days in the hospital are summarized by the boxwhisker plot below.
Distribution of Days in the Hospital Following Transplant
Note that 75% of the participants stay at most 16 days in the hospital following transplant, while at least 1 stays 35 days which would be considered an outlier. Recall from page 8 in the module on Summarizing Data that we used Q_{1}1.5(Q_{3}Q_{1}) as a lower limit and Q_{3}+1.5(Q_{3}Q_{1}) as an upper limit to detect outliers. In the boxwhisker plot above, 10.2, Q_{1}=12 and Q_{3}=16, thus outliers are values below 121.5(1612) = 6 or above 16+1.5(1612) = 22.
Limits of Detection
In some studies, the outcome is a continuous variable that is measured with some imprecision (e.g., with clear limits of detection). For example, some instruments or assays cannot measure presence of specific quantities above or below certain limits. HIV viral load is a measure of the amount of virus in the body and is measured as the amount of virus per a certain volume of blood. It can range from "not detected" or "below the limit of detection" to hundreds of millions of copies. Thus, in a sample some participants may have measures like 1,254,000 or 874,050 copies and others are measured as "not detected." If a substantial number of participants have undetectable levels, the distribution of viral load is not normally distributed.
Hypothesis Testing with Nonparametric Tests In nonparametric tests, the hypotheses are not about population parameters (e.g., μ=50 or μ_{1}=μ_{2}). Instead, the null hypothesis is more general. For example, when comparing two independent groups in terms of a continuous outcome, the null hypothesis in a parametric test is H_{0}: μ_{1} =μ_{2}. In a nonparametric test the null hypothesis is that the two populations are equal, often this is interpreted as the two populations are equal in terms of their central tendency.

Advantages of Nonparametric Tests
Nonparametric tests have some distinct advantages. With outcomes such as those described above, nonparametric tests may be the only way to analyze these data. Outcomes that are ordinal, ranked, subject to outliers or measured imprecisely are difficult to analyze with parametric methods without making major assumptions about their distributions as well as decisions about coding some values (e.g., "not detected"). As described here, nonparametric tests can also be relatively simple to conduct.
Introduction to Nonparametric Testing
This module will describe some popular nonparametric tests for continuous outcomes. Interested readers should see Conover^{3} for a more comprehensive coverage of nonparametric tests.
Key Concept:
Parametric tests are generally more powerful and can test a wider range of alternative hypotheses. It is worth repeating that if data are approximately normally distributed then parametric tests (as in the modules on hypothesis testing) are more appropriate. However, there are situations in which assumptions for a parametric test are violated and a nonparametric test is more appropriate. 
The techniques described here apply to outcomes that are ordinal, ranked, or continuous outcome variables that are not normally distributed. Recall that continuous outcomes are quantitative measures based on a specific measurement scale (e.g., weight in pounds, height in inches). Some investigators make the distinction between continuous, interval and ordinal scaled data. Interval data are like continuous data in that they are measured on a constant scale (i.e., there exists the same difference between adjacent scale scores across the entire spectrum of scores). Differences between interval scores are interpretable, but ratios are not. Temperature in Celsius or Fahrenheit is an example of an interval scale outcome. The difference between 30º and 40º is the same as the difference between 70º and 80º, yet 80º is not twice as warm as 40º. Ordinal outcomes can be less specific as the ordered categories need not be equally spaced. Symptom severity is an example of an ordinal outcome and it is not clear whether the difference between much worse and slightly worse is the same as the difference between no change and slightly improved. Some studies use visual scales to assess participants' selfreported signs and symptoms. Pain is often measured in this way, from 0 to 10 with 0 representing no pain and 10 representing agonizing pain. Participants are sometimes shown a visual scale such as that shown in the upper portion of the figure below and asked to choose the number that best represents their pain state. Sometimes pain scales use visual anchors as shown in the lower portion of the figure below.
Visual Pain Scale
In the upper portion of the figure, certainly 10 is worse than 9, which is worse than 8; however, the difference between adjacent scores may not necessarily be the same. It is important to understand how outcomes are measured to make appropriate inferences based on statistical analysis and, in particular, not to overstate precision.
Assigning Ranks
The nonparametric procedures that we describe here follow the same general procedure. The outcome variable (ordinal, interval or continuous) is ranked from lowest to highest and the analysis focuses on the ranks as opposed to the measured or raw values. For example, suppose we measure selfreported pain using a visual analog scale with anchors at 0 (no pain) and 10 (agonizing pain) and record the following in a sample of n=6 participants:
7 5 9 3 0 2
The ranks, which are used to perform a nonparametric test, are assigned as follows: First, the data are ordered from smallest to largest. The lowest value is then assigned a rank of 1, the next lowest a rank of 2 and so on. The largest value is assigned a rank of n (in this example, n=6). The observed data and corresponding ranks are shown below:
Ordered Observed Data:  0  2  3  5  7  9 
Ranks:  1  2  3  4  5  6 
A complicating issue that arises when assigning ranks occurs when there are ties in the sample (i.e., the same values are measured in two or more participants). For example, suppose that the following data are observed in our sample of n=6:
Observed Data: 7 7 9 3 0 2
The 4^{th} and 5^{th} ordered values are both equal to 7. When assigning ranks, the recommended procedure is to assign the mean rank of 4.5 to each (i.e. the mean of 4 and 5), as follows:
Ordered Observed Data:  0.5  2.5  3.5  7  7  9 
Ranks:  1.5  2.5  3.5  4.5  4.5  6 
Suppose that there are three values of 7. In this case, we assign a rank of 5 (the mean of 4, 5 and 6) to the 4^{th}, 5^{th} and 6^{th} values, as follows:
Ordered Observed Data:  0  2  3  7  7  7 
Ranks:  1  2  3  5  5  5 
Using this approach of assigning the mean rank when there are ties ensures that the sum of the ranks is the same in each sample (for example, 1+2+3+4+5+6=21, 1+2+3+4.5+4.5+6=21 and 1+2+3+5+5+5=21). Using this approach, the sum of the ranks will always equal n(n+1)/2. When conducting nonparametric tests, it is useful to check the sum of the ranks before proceeding with the analysis.
To conduct nonparametric tests, we again follow the fivestep approach outlined in the modules on hypothesis testing.
 Set up hypotheses and select the level of significance α. Analogous to parametric testing, the research hypothesis can be one or two sided (one or twotailed), depending on the research question of interest.
 Select the appropriate test statistic. The test statistic is a single number that summarizes the sample information. In nonparametric tests, the observed data is converted into ranks and then the ranks are summarized into a test statistic.
 Set up decision rule. The decision rule is a statement that tells under what circumstances to reject the null hypothesis. Note that in some nonparametric tests we reject H_{0} if the test statistic is large, while in others we reject H_{0} if the test statistic is small. We make the distinction as we describe the different tests.
 Compute the test statistic. Here we compute the test statistic by summarizing the ranks into the test statistic identified in Step 2.
 Conclusion. The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule. The final conclusion is either to reject the null hypothesis (because it is very unlikely to observe the sample data if the null hypothesis is true) or not to reject the null hypothesis (because the sample data are not very unlikely if the null hypothesis is true).
Mann Whitney U Test (Wilcoxon Rank Sum Test)
The modules on hypothesis testing presented techniques for testing the equality of means in two independent samples. An underlying assumption for appropriate use of the tests described was that the continuous outcome was approximately normally distributed or that the samples were sufficiently large (usually n_{1}> 30 and n_{2}> 30) to justify their use based on the Central Limit Theorem. When comparing two independent samples when the outcome is not normally distributed and the samples are small, a nonparametric test is appropriate.
A popular nonparametric test to compare outcomes between two independent groups is the Mann Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test, is used to test whether two samples are likely to derive from the same population (i.e., that the two populations have the same shape). Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means (H_{0}: μ_{1}=μ_{2}) between independent groups.
In contrast, the null and twosided research hypotheses for the nonparametric test are stated as follows:
H_{0}: The two populations are equal versus
H_{1}: The two populations are not equal.
This test is often performed as a twosided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A onesided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to n_{1}+n_{2}, respectively.
Example:
Consider a Phase II clinical trial designed to investigate the effectiveness of a new drug to reduce symptoms of asthma in children. A total of n=10 participants are randomized to receive either the new drug or a placebo. Participants are asked to record the number of episodes of shortness of breath over a 1 week period following receipt of the assigned treatment. The data are shown below.
Placebo  7  5  6  4  12 
New Drug  3  6  4  2  1 
Is there a difference in the number of episodes of shortness of breath over a 1 week period in participants receiving the new drug as compared to those receiving the placebo? By inspection, it appears that participants receiving the placebo have more episodes of shortness of breath, but is this statistically significant?
In this example, the outcome is a count and in this sample the data do not follow a normal distribution.
Frequency Histogram of Number of Episodes of Shortness of Breath
In addition, the sample size is small (n_{1}=n_{2}=5), so a nonparametric test is appropriate. The hypothesis is given below, and we run the test at the 5% level of significance (i.e., α=0.05).
H_{0}: The two populations are equal versus
H_{1}: The two populations are not equal.
Note that if the null hypothesis is true (i.e., the two populations are equal), we expect to see similar numbers of episodes of shortness of breath in each of the two treatment groups, and we would expect to see some participants reporting few episodes and some reporting more episodes in each group. This does not appear to be the case with the observed data. A test of hypothesis is needed to determine whether the observed data is evidence of a statistically significant difference in populations.
The first step is to assign ranks and to do so we order the data from smallest to largest. This is done on the combined or total sample (i.e., pooling the data from the two treatment groups (n=10)), and assigning ranks from 1 to 10, as follows. We also need to keep track of the group assignments in the total sample.

 Total Sample (Ordered Smallest to Largest)  Ranks  

Placebo  New Drug  Placebo  New Drug  Placebo  New Drug 
7  3 
 1 
 1 
5  6 
 2 
 2 
6  4 
 3 
 3 
4  2  4  4  4.5  4.5 
12  1  5 
 6 


 6  6  7.5  7.5 

 7 
 9 


 12 
 10 

Note that the lower ranks (e.g., 1, 2 and 3) are assigned to responses in the new drug group while the higher ranks (e.g., 9, 10) are assigned to responses in the placebo group. Again, the goal of the test is to determine whether the observed data support a difference in the populations of responses. Recall that in parametric tests (discussed in the modules on hypothesis testing), when comparing means between two groups, we analyzed the difference in the sample means relative to their variability and summarized the sample information in a test statistic. A similar approach is employed here. Specifically, we produce a test statistic based on the ranks.
First, we sum the ranks in each group. In the placebo group, the sum of the ranks is 37; in the new drug group, the sum of the ranks is 18. Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 = 10(11)/2=55 which is equal to 37+18 = 55.
For the test, we call the placebo group 1 and the new drug group 2 (assignment of groups 1 and 2 is arbitrary). We let R_{1} denote the sum of the ranks in group 1 (i.e., R_{1}=37), and R_{2} denote the sum of the ranks in group 2 (i.e., R_{2}=18). If the null hypothesis is true (i.e., if the two populations are equal), we expect R_{1} and R_{2} to be similar. In this example, the lower values (lower ranks) are clustered in the new drug group (group 2), while the higher values (higher ranks) are clustered in the placebo group (group 1). This is suggestive, but is the observed difference in the sums of the ranks simply due to chance? To answer this we will compute a test statistic to summarize the sample information and look up the corresponding value in a probability distribution.
Test Statistic for the Mann Whitney U Test
The test statistic for the Mann Whitney U Test is denoted U and is the smaller of U_{1} and U_{2}, defined below.
where R_{1} = sum of the ranks for group 1 and R_{2} = sum of the ranks for group 2.
For this example,
In our example, U=3. Is this evidence in support of the null or research hypothesis? Before we address this question, we consider the range of the test statistic U in two different situations.
Situation #1
Consider the situation where there is complete separation of the groups, supporting the research hypothesis that the two populations are not equal. If all of the higher numbers of episodes of shortness of breath (and thus all of the higher ranks) are in the placebo group, and all of the lower numbers of episodes (and ranks) are in the new drug group and that there are no ties, then:
and
Therefore, when there is clearly a difference in the populations, U=0.
Situation #2
Consider a second situation where low and high scores are approximately evenly distributed in the two groups, supporting the null hypothesis that the groups are equal. If ranks of 2, 4, 6, 8 and 10 are assigned to the numbers of episodes of shortness of breath reported in the placebo group and ranks of 1, 3, 5, 7 and 9 are assigned to the numbers of episodes of shortness of breath reported in the new drug group, then:
R_{1}= 2+4+6+8+10 = 30 and R_{2}= 1+3+5+7+9 = 25,
and
When there is clearly no difference between populations, then U=10.
Thus, smaller values of U support the research hypothesis, and larger values of U support the null hypothesis.
Key Concept: For any MannWhitney U test, the theoretical range of U is from 0 (complete separation between groups, H_{0} most likely false and H_{1} most likely true) to n_{1}*n_{2} (little evidence in support of H_{1}).
In every test, U_{1}+U_{2 } is always equal to n_{1}*n_{2}. In the example above, U can range from 0 to 25 and smaller values of U support the research hypothesis (i.e., we reject H_{0} if U is small). The procedure for determining exactly when to reject H_{0} is described below. 
In every test, we must determine whether the observed U supports the null or research hypothesis. This is done following the same approach used in parametric testing. Specifically, we determine a critical value of U such that if the observed value of U is less than or equal to the critical value, we reject H_{0} in favor of H_{1} and if the observed value of U exceeds the critical value we do not reject H_{0}.
The critical value of U can be found in the table below. To determine the appropriate critical value we need sample sizes (for Example: n_{1}=n_{2}=5) and our twosided level of significance (α=0.05). For Example 1 the critical value is 2, and the decision rule is to reject H_{0} if U < 2. We do not reject H_{0} because 3 > 2. We do not have statistically significant evidence at α =0.05, to show that the two populations of numbers of episodes of shortness of breath are not equal. However, in this example, the failure to reach statistical significance may be due to low power. The sample data suggest a difference, but the sample sizes are too small to conclude that there is a statistically significant difference.
Table of Critical Values for U
Example:
A new approach to prenatal care is proposed for pregnant women living in a rural community. The new program involves inhome visits during the course of pregnancy in addition to the usual or regularly scheduled visits. A pilot randomized trial with 15 pregnant women is designed to evaluate whether women who participate in the program deliver healthier babies than women receiving usual care. The outcome is the APGAR score measured 5 minutes after birth. Recall that APGAR scores range from 0 to 10 with scores of 7 or higher considered normal (healthy), 46 low and 03 critically low. The data are shown below.
Usual Care  8  7  6  2  5  8  7  3 
New Program  9  9  7  8  10  9  6 

Is there statistical evidence of a difference in APGAR scores in women receiving the new and enhanced versus usual prenatal care? We run the test using the fivestep approach.
 Step 1. Set up hypotheses and determine level of significance.
H_{0}: The two populations are equal versus
H_{1}: The two populations are not equal. α =0.05
 Step 2. Select the appropriate test statistic.
Because APGAR scores are not normally distributed and the samples are small (n_{1}=8 and n_{2}=7), we use the Mann Whitney U test. The test statistic is U, the smaller of
where R_{1} and R_{2} are the sums of the ranks in groups 1 and 2, respectively.
 Step 3. Set up decision rule.
The appropriate critical value can be found in the table above. To determine the appropriate critical value we need sample sizes (n_{1}=8 and n_{2}=7) and our twosided level of significance (α=0.05). The critical value for this test with n_{1}=8, n_{2}=7 and α =0.05 is 10 and the decision rule is as follows: Reject H_{0} if U < 10.
 Step 4. Compute the test statistic.
The first step is to assign ranks of 1 through 15 to the smallest through largest values in the total sample, as follows:

 Total Sample (Ordered Smallest to Largest)  Ranks  

Usual Care  New Program  Usual Care  New Program  Usual Care  New Program 
8  9  2 
 1 

7  8  3 
 2 

6  7  5 
 3 

2  8  6  6  4.5  4.5 
5  10  7  7  7  7 
8  9  7 
 7 

7  6  8  8  10.5  10.5 
3 
 8  8  10.5  10.5 


 9 
 13.5 


 9 
 13.5 


 10 
 15 



 R_{1}=45.5  R_{2}=74.5 
Next, we sum the ranks in each group. In the usual care group, the sum of the ranks is R_{1}=45.5 and in the new program group, the sum of the ranks is R_{2}=74.5. Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 = 15(16)/2=120 which is equal to 45.5+74.5 = 120.
We now compute U_{1} and U_{2}, as follows:
Thus, the test statistic is U=9.5.
 Step 5. Conclusion:
We reject H_{0} because 9.5 < 10. We have statistically significant evidence at α =0.05 to show that the populations of APGAR scores are not equal in women receiving usual prenatal care as compared to the new program of prenatal care.
Example:
A clinical trial is run to assess the effectiveness of a new antiretroviral therapy for patients with HIV. Patients are randomized to receive a standard antiretroviral therapy (usual care) or the new antiretroviral therapy and are monitored for 3 months. The primary outcome is viral load which represents the number of HIV copies per milliliter of blood. A total of 30 participants are randomized and the data are shown below.
Standard Therapy  7500  8000  2000  550  1250  1000  2250  6800  3400  6300  9100  970  1040  670  400 
New Therapy  400  250  800  1400  8000  7400  1020  6000  920  1420  2700  4200  5200  4100  undetectable 
Is there statistical evidence of a difference in viral load in patients receiving the standard versus the new antiretroviral therapy?
 Step 1. Set up hypotheses and determine level of significance.
H_{0}: The two populations are equal versus
H_{1}: The two populations are not equal. α=0.05
 Step 2. Select the appropriate test statistic.
Because viral load measures are not normally distributed (with outliers as well as limits of detection (e.g., "undetectable")), we use the MannWhitney U test. The test statistic is U, the smaller of
where R_{1} and R_{2} are the sums of the ranks in groups 1 and 2, respectively.
 Step 3. Set up the decision rule.
The critical value can be found in the table of critical values based on sample sizes (n_{1}=n_{2}=15) and a twosided level of significance (α=0.05). The critical value 64 and the decision rule is as follows: Reject H_{0} if U < 64.
 Step 4. Compute the test statistic.
The first step is to assign ranks of 1 through 30 to the smallest through largest values in the total sample. Note in the table below, that the "undetectable" measurement is listed first in the ordered values (smallest) and assigned a rank of 1.

 Total Sample (Ordered Smallest to Largest)  Ranks  

Standard Antiretroviral  New Antiretroviral  Standard Antiretroviral  New Antiretroviral  Standard Antiretroviral  New Antiretroviral 
7500  400 
 undetectable 
 1 
8000  250 
 250 
 2 
2000  800  400  400  3.5  3.5 
550  1400  550 
 5 

1250  8000  670 
 6 

1000  7400 
 800 
 7 
2250  1020 
 920 
 8 
6800  6000  970 
 9 

3400  920  1000 
 10 

6300  1420 
 1020 
 11 
9100  2700  1040 
 12 

970  4200  1250 
 13 

1040  5200 
 1400 
 14 
670  4100 
 1420 
 15 
400  undetectable  2000 
 16 


 2250 
 17 



 2700 
 18 

 3400 
 19 



 4100 
 20 


 4200 
 21 


 5200 
 22 


 6000 
 23 

 6300 
 24 


 6800 
 25 



 7400 
 26 

 7500 
 27 


 8000  8000  28.5  28.5 

 9100 
 30 




 R_{1} = 245  R_{2} = 220 
Next, we sum the ranks in each group. In the standard antiretroviral therapy group, the sum of the ranks is R_{1}=245; in the new antiretroviral therapy group, the sum of the ranks is R_{2}=220. Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 = 30(31)/2=465 which is equal to 245+220 = 465. We now compute U_{1} and U_{2}, as follows,
Thus, the test statistic is U=100.
 Step 5. Conclusion.
We do not reject H_{0} because 100 > 64. We do not have sufficient evidence to conclude that the treatment groups differ in viral load.
Tests with Matched Samples
This section describes nonparametric tests to compare two groups with respect to a continuous outcome when the data are collected on matched or paired samples. The parametric procedure for doing this was presented in the modules on hypothesis testing for the situation in which the continuous outcome was normally distributed. This section describes procedures that should be used when the outcome cannot be assumed to follow a normal distribution. There are two popular nonparametric tests to compare outcomes between two matched or paired groups. The first is called the Sign Test and the second the Wilcoxon Signed Rank Test.
Recall that when data are matched or paired, we compute difference scores for each individual and analyze difference scores. The same approach is followed in nonparametric tests. In parametric tests, the null hypothesis is that the mean difference (μ_{d}) is zero. In nonparametric tests, the null hypothesis is that the median difference is zero.
Example:
Consider a clinical investigation to assess the effectiveness of a new drug designed to reduce repetitive behaviors in children affected with autism. If the drug is effective, children will exhibit fewer repetitive behaviors on treatment as compared to when they are untreated. A total of 8 children with autism enroll in the study. Each child is observed by the study psychologist for a period of 3 hours both before treatment and then again after taking the new drug for 1 week. The time that each child is engaged in repetitive behavior during each 3 hour observation period is measured. Repetitive behavior is scored on a scale of 0 to 100 and scores represent the percent of the observation time in which the child is engaged in repetitive behavior. For example, a score of 0 indicates that during the entire observation period the child did not engage in repetitive behavior while a score of 100 indicates that the child was constantly engaged in repetitive behavior. The data are shown below.
Child  Before Treatment  After 1 Week of Treatment 

1  85  75 
2  70  50 
3  40  50 
4  65  40 
5  80  20 
6  75  65 
7  55  40 
8  20  25 
Looking at the data, it appears that some children improve (e.g., Child 5 scored 80 before treatment and 20 after treatment), but some got worse (e.g., Child 3 scored 40 before treatment and 50 after treatment). Is there statistically significant improvement in repetitive behavior after 1 week of treatment?.
Because the before and after treatment measures are paired, we compute difference scores for each child. In this example, we subtract the assessment of repetitive behaviors after treatment from that measured before treatment so that difference scores represent improvement in repetitive behavior. The question of interest is whether there is significant improvement after treatment.
Child  Before Treatment  After 1 Week of Treatment  Difference (BeforeAfter) 

1  85  75  10 
2  70  50  20 
3  40  50  10 
4  65  40  25 
5  80  20  60 
6  75  65  10 
7  55  40  15 
8  20  25  5 
In this small sample, the observed difference (or improvement) scores vary widely and are subject to extremes (e.g., the observed difference of 60 is an outlier). Thus, a nonparametric test is appropriate to test whether there is significant improvement in repetitive behavior before versus after treatment. The hypotheses are given below.
H_{0}: The median difference is zero versus
H_{1}: The median difference is positive α=0.05
In this example, the null hypothesis is that there is no difference in scores before versus after treatment. If the null hypothesis is true, we expect to see some positive differences (improvement) and some negative differences (worsening). If the research hypothesis is true, we expect to see more positive differences after treatment as compared to before.
The Sign Test
The Sign Test is the simplest nonparametric test for matched or paired data. The approach is to analyze only the signs of the difference scores, as shown below:
Child  Before Treatment  After 1 Week of Treatment  Difference (BeforeAfter)  Sign 

1  85  75  10  + 
2  70  50  20  + 
3  40  50  10   
4  65  40  25  + 
5  80  20  60  + 
6  75  65  10  + 
7  55  40  15  + 
8  20  25  5   
If the null hypothesis is true (i.e., if the median difference is zero) then we expect to see approximately half of the differences as positive and half of the differences as negative. If the research hypothesis is true, we expect to see more positive differences.
Test Statistic for the Sign Test
The test statistic for the Sign Test is the number of positive signs or number of negative signs, whichever is smaller. In this example, we observe 2 negative and 6 positive signs. Is this evidence of significant improvement or simply due to chance?
Determining whether the observed test statistic supports the null or research hypothesis is done following the same approach used in parametric testing. Specifically, we determine a critical value such that if the smaller of the number of positive or negative signs is less than or equal to that critical value, then we reject H_{0} in favor of H_{1} and if the smaller of the number of positive or negative signs is greater than the critical value, then we do not reject H_{0}. Notice that this is a onesided decision rule corresponding to our onesided research hypothesis (the twosided situation is discussed in the next example).
Table of Critical Values for the Sign Test
The critical values for the Sign Test are in the table below.
To determine the appropriate critical value we need the sample size, which is equal to the number of matched pairs (n=8) and our onesided level of significance α=0.05. For this example, the critical value is 1, and the decision rule is to reject H_{0} if the smaller of the number of positive or negative signs < 1. We do not reject H_{0} because 2 > 1. We do not have sufficient evidence at α=0.05 to show that there is improvement in repetitive behavior after taking the drug as compared to before. In essence, we could use the critical value to decide whether to reject the null hypothesis. Another alternative would be to calculate the pvalue, as described below.
Computing Pvalues for the Sign Test
With the Sign test we can readily compute a pvalue based on our observed test statistic. The test statistic for the Sign Test is the smaller of the number of positive or negative signs and it follows a binomial distribution with n = the number of subjects in the study and p=0.5 (See the module on Probability for details on the binomial distribution). In the example above, n=8 and p=0.5 (the probability of success under H_{0}).
By using the binomial distribution formula:
we can compute the probability of observing different numbers of successes during 8 trials. These are shown in the table below.
x=Number of Successes  P(x successes) 

0  0.0039 
1  0.0313 
2  0.1094 
3  0.2188 
4  0.2734 
5  0.2188 
6  0.1094 
7  0.0313 
8  0.0039 
Recall that a pvalue is the probability of observing a test statistic as or more extreme than that observed. We observed 2 negative signs. Thus, the pvalue for the test is: pvalue = P(x < 2). Using the table above,
Because the pvalue = 0.1446 exceeds the level of significance α=0.05, we do not have statistically significant evidence that there is improvement in repetitive behaviors after taking the drug as compared to before. Notice in the table of binomial probabilities above, that we would have had to observe at most 1 negative sign to declare statistical significance using a 5% level of significance. Recall the critical value for our test was 1 based on the table of critical values for the Sign Test (above).
OneSided versus TwoSided Test
In the example looking for differences in repetitive behaviors in autistic children, we used a onesided test (i.e., we hypothesize improvement after taking the drug). A two sided test can be used if we hypothesize a difference in repetitive behavior after taking the drug as compared to before. From the table of critical values for the Sign Test, we can determine a twosided critical value and again reject H_{0} if the smaller of the number of positive or negative signs is less than or equal to that twosided critical value. Alternatively, we can compute a twosided pvalue. With a twosided test, the pvalue is the probability of observing many or few positive or negative signs. If the research hypothesis is a two sided alternative (i.e., H_{1}: The median difference is not zero), then the pvalue is computed as: pvalue = 2*P(x < 2). Notice that this is equivalent to pvalue = P(x < 2) + P(x > 6), representing the situation of few or many successes. Recall in twosided tests, we reject the null hypothesis if the test statistic is extreme in either direction. Thus, in the Sign Test, a twosided pvalue is the probability of observing few or many positive or negative signs. Here we observe 2 negative signs (and thus 6 positive signs). The opposite situation would be 6 negative signs (and thus 2 positive signs as n=8). The twosided pvalue is the probability of observing a test statistic as or more extreme in either direction (i.e.,
When Difference Scores are Zero
There is a special circumstance that needs attention when implementing the Sign Test which arises when one or more participants have difference scores of zero (i.e., their paired measurements are identical). If there is just one difference score of zero, some investigators drop that observation and reduce the sample size by 1 (i.e., the sample size for the binomial distribution would be n1). This is a reasonable approach if there is just one zero. However, if there are two or more zeros, an alternative approach is preferred.
 If there is an even number of zeros, we randomly assign them positive or negative signs.
 If there is an odd number of zeros, we randomly drop one and reduce the sample size by 1, and then randomly assign the remaining observations positive or negative signs. The following example illustrates the approach.
Example:
A new chemotherapy treatment is proposed for patients with breast cancer. Investigators are concerned with patient's ability to tolerate the treatment and assess their quality of life both before and after receiving the new chemotherapy treatment. Quality of life (QOL) is measured on an ordinal scale and for analysis purposes, numbers are assigned to each response category as follows: 1=Poor, 2= Fair, 3=Good, 4= Very Good, 5 = Excellent. The data are shown below.
Patient  QOL Before Chemotherapy Treatment  QOL After Chemotherapy Treatment 

1  3  2 
2  2  3 
3  3  4 
4  2  4 
5  1  1 
6  3  4 
7  2  4 
8  3  3 
9 
One thought on “Nonparametric Methods Nominal Level Hypothesis Statement”