by GD Kader · 2007 · Cited by 105 — Variability in categorical data is different from variability in quantitative data. This paper develops the coefficient of unalikeability as a measure of categorical
100 KB – 17 Pages
PAGE – 1 ============
Variability for Categorical Variables Gary D. Kader Appalachian State University Mike Perry Appalachian State University Journal of Statistics Education Volume 15, Number 2 (2007), http://www.amstat.org/publications/jse/v15n2/kader.html Copyright © 2007 by Gary D. Kader and Mike Perry all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor. Keywords: Variability, Categorical Variable, Unalikeability Abstract Introductory statistics textbooks rarely discuss the concept of variability for a categorical variable and thus, in this case, do not provide a measure of variability. The impression is thus given that there is no measurement of variability for a categorical variable. A measure of variability depends on the concept of variability. Research has shown that “unalikeability” is a more natural concept than “variation about the mean” for many students. A “coefficient of unalikeablity” can be used to measure this type of variability. Variability in categorical data is different from variability in quantitative data. This paper develops the coefficient of unalikeability as a measure of categorical variability. 1. Introduction Introductory statistics textbooks give considerable attention, as they should, to the distribution of quantitative variables and measures of their variability. Discussions of categorical variables, however, typically do not. The treatment of categorical data analysis usually moves immediately to the more interesting questions formulated in terms of contingency tables, with the focus of the analysis on variability among counts in the
PAGE – 2 ============
1 table. There is usually no discussion of the concept of variability for a categorical variable, and thus no mention of a measure of variability that plays the role that standard deviation plays in the quantitative case. The impression is thus given that there is no concept of variability for a categorical variable, or, if there is one, there is no known way of measuring it. This impression is incorrect. There is a concept of variability for a categorical variable, and there are ways of measuring it. We suspect that a significant percentage of the teachers of introductory statistics are unaware of these ideas, and readily admit that we were not until we investigated the ideas presented in this paper. 1.1 Objectives The purposes of this paper are several fold, including: 1. Describe a concept of variability for a categorical variable, and provide a method for its measurement. This is done at an elementary level which requires no probability or statistics background and thus is appropriate for an introductory course. 2. Show how these ideas evolved from research results on students’ concepts of variability for quantitative variables. 3. Although our development is done independently of previous ideas, we point out that the underlying ideas have been around for at least ninety years. The early uses were for specialized applications or in statistically sophisticated settings and thus not presented in a fashion appropriate for a student’s first exposure to variability. 1.2 Students’ Concepts of Variability Intuitive concepts of variation might differ among our students; we may be talking about one concept of variation in our classes while our students are thinking about another! In a study by Loosen, Lioen, and Lacante (1985), students were shown two sets of blocks, referred to here as set I and set II (see Figure 1). In the original study the blocks in set I were painted red and were 10, 20, 30, 40, 50, 60 cm high. The blocks of set II were painted yellow and were 10, 10, 10 and 60, 60, 60 cm high. Note that for quantitative data the height of each block in this physical representation indicates the magnitude of the corresponding value. The students were instructed as follows: ÒThese are two sets of blocks: a set of red blocks and a set of yellow ones. In which set do the blocks have the greater variation among themselves?Ó Fifty percent selected set I for the greater variation, 36% selected II and 14% said there was no difference. The 50% who selected set I are making their judgment on the observation that no two blocks have the same length. These students are basing their choice on an intuitive concept of variability – unalikeability Ð the lack of bars of the same size or the lack of clusters of bars of the same size. These learners do not think of variation as Òhow much the values differ from the mean.Ó Their perception has to do with Òhow often the
PAGE – 3 ============
2 observations differ from one another.Ó The authors point out that this can be an important part of a classroom lesson. The teacher can show the students that the standard deviation would indicate that set II has the greater variation because its standard deviation is larger than that of set I, and that the standard deviation is not measuring the concept of variation for students who selected set I. 1.3 The Coefficient of Unalikeability Unalikeability is defined to mean how often observations differ from one another. The concept of unalikeability focuses on how often observations differ, not how much. The incidence of differences for the six blocks of set I and set II are indicated in Table 1. Each table gives all possible pairings of the sizes of the bars, and table entries are either 0 or 1 to indicate whether the block sizes are equal or different, respectively. Note that all pairs are indicated twice — once in each half of the table. Comparisons of a block with itself are not of interest and are indicated with an asterisk.
PAGE – 5 ============
4 Table 1. Incidence of Differences Set I 10 20 30 40 50 60 10 * 1 1 1 1 1 20 1 * 1 1 1 1 30 1 1 * 1 1 1 40 1 1 1 * 1 1 50 1 1 1 1 * 1 60 1 1 1 1 1 * Set II 10 10 10 60 60 60 10 * 0 0 1 1 1 10 0 * 0 1 1 1 10 0 0 * 1 1 1 60 1 1 1 * 0 0 60 1 1 1 0 * 0 60 1 1 1 0 0 * If the 1Õs in a table are added up, we obtain the number of differences that occur when all possible comparisons are made, one observation with another. If we divide by 36-6=30, the number of comparisons, then we get the proportion of differences that occur. For set I, where all of the data differ from one another, this proportion is 30/30 =1. For set II, the proportion is 18/30 = 0.60. Note that since all pairs appear twice, only half of the entries need to be counted. In the case of set II, there would be 15 comparisons, and the proportion would be 9/15 = 0.60. Also note that if all of the data are equal in value, this proportion is 0. This provides a coefficient of unalikeability on a scale from 0 to 1. The higher the value, the more unalike the data are. If x1, x2, É, xn are n observations on a quantitative variable, x, Perry and Kader (2005) give a general definition for the coefficient of unalikeability as: where
PAGE – 6 ============
5 This coefficient was suggested by the idea of a Òwithin dataÓ variance. Gordon (1986) reminds us that standard deviation and variance can be defined independently of the mean by taking the average of the squares of the differences between each pair of values: The coefficient of unalikeability mimics this idea by replacing the squares of distances with the 0 – 1 indicator of differences. Gordon points out that 2. ANOTHER LOOK AT UNALIKEABILITY We were recently examining some of the ideas underlying the coefficient of unalikeability and in doing so took a look at the coefficient from the perspective of categorical variables. Although the length of the bars in Figure 1 is a quantitative variable, the students who think of variability as unalikeability are forming categories. A category consists of all bars of the same length; once the categories are formed, the actual lengths are ignored. Note that in the case of a categorical variable, x, each observation is classified into one of m distinct categories. In this case, the definition for quantity becomes: 2.1 Visualizing Variability in Categorical Data Variability in categorical data is somewhat different than variability in numerical data. LetÕs begin by examining three groups of data with ten responses on a variable with two possible outcomes Ð Category A or Category B. Group 1: Seven responses in Category A; three responses in Category B Group 2: Five responses in Category A; five responses in Category B Group 3: One response in Category A; nine responses in Category B Figure 2 provides a physical representation for these three different situations. Note that, unlike numerical data, the bar height in this representation for categorical data does not indicate the magnitude of a response; it indicates only whether the response was in Category A or Category B.
PAGE – 8 ============
7 The incidence of differences for the ten responses of Groups 1, 2, and 3 are shown in Table 2. Each table gives all possible pairings of responses, and table entries are either 1 or 0 to indicate whether the responses are unalike or alike, respectively. The corresponding values for u2 are indicated in Table 3. Group 1 Group 2 Group 3 Figure 2. Physical Representations for Three Groups of Categorical Data A A B B B B B B B B B A A A A A A B B B B B A A A A A A A A B B B
PAGE – 9 ============
8 Table 2. Incidence of Differences for Three Groups of Categorical Data Group 1 A A A A A A A B B B A 0 0 0 0 0 0 0 1 1 1 A 0 0 0 0 0 0 0 1 1 1 A 0 0 0 0 0 0 0 1 1 1 A 0 0 0 0 0 0 0 1 1 1 A 0 0 0 0 0 0 0 1 1 1 A 0 0 0 0 0 0 0 1 1 1 A 0 0 0 0 0 0 0 1 1 1 B 1 1 1 1 1 1 1 0 0 0 B 1 1 1 1 1 1 1 0 0 0 B 1 1 1 1 1 1 1 0 0 0 Group 2 A A A A A B B B B B A 0 0 0 0 0 1 1 1 1 1 A 0 0 0 0 0 1 1 1 1 1 A 0 0 0 0 0 1 1 1 1 1 A 0 0 0 0 0 1 1 1 1 1 A 0 0 0 0 0 1 1 1 1 1 B 1 1 1 1 1 0 0 0 0 0 B 1 1 1 1 1 0 0 0 0 0 B 1 1 1 1 1 0 0 0 0 0 B 1 1 1 1 1 0 0 0 0 0 B 1 1 1 1 1 0 0 0 0 0 Group 3 A B B B B B B B B B A 0 1 1 1 1 1 1 1 1 1 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0 B 1 0 0 0 0 0 0 0 0 0
PAGE – 10 ============
9 Table 3. Value of for Three Groups of Categorical Data Group u2 1 42/100 = .42 2 50/100 = .50 3 18/100 = .18 The values for indicate that the data in Group 3 are most alike and the data in Group 2 are most unalike. That is, Group 3 has the least variation and Group 2 has the most variation. A second look at the table of incidences for Group 1 (Table 2) reveals that the 1’s occur in the array in blocks. The sum of the 1’s can be determined by: Thus Note that here u2 has the form: (1) where are the proportion of responses in categories A, B respectively. The sum of the 1’s can also be determined by: Thus Note that here u2 has the form . The sum of the 1’s can also be determined by: Thus Note that here u2 has the form .
PAGE – 11 ============
10 In each case we get .42, the proportion of possible pairings which are unalike. Note that the three formulas for finding u2 work for the Groups 2 and 3 as well. 2.3 Connections to a Bernoulli Variable The widely used Bernoulli variable codes responses for a two-outcome categorical variable as 1 (Category A) or 0 (Category B). With = the proportion of 1Õs or the proportion of responses in Category A, and = = the proportion of 0Õs or the proportion of responses in Category B. It is well known that the mean of a Bernoulli variable is and the variance, V, is . So, like the second form of GordonÕs within variance, , the coefficient of unalikeability as described in Equation (1) can be expressed as: 2.4 Quantifying Variability with Three Categories Consider the following data on ten responses for a variable with three possible outcomes Ð Category A, Category B or Category C: Group 4: Two responses in Category A; three responses in Category B; and five responses in Category C The table of incidences for Group 4 (Table 4) reveals that the 1’s again occur in the array in blocks. Table 4. Incidence of Differences for Three Outcome Categorical Variable Group 4 A A B B B C C C C C A 0 0 1 1 1 1 1 1 1 1 A 0 0 1 1 1 1 1 1 1 1 B 1 1 0 0 0 1 1 1 1 1 B 1 1 0 0 0 1 1 1 1 1 B 1 1 0 0 0 1 1 1 1 1 C 1 1 1 1 1 0 0 0 0 0 C 1 1 1 1 1 0 0 0 0 0 C 1 1 1 1 1 0 0 0 0 0 C 1 1 1 1 1 0 0 0 0 0 C 1 1 1 1 1 0 0 0 0 0 The sum of the 1’s in Table 4 can be determined by:
100 KB – 17 Pages