]]>

]]>

In the statistical literature, for ordinal types of data, are known lots of indicators to measure the degree of the polarization phenomenon. Typically, many of the widely used measures of distributional variability are defined as a function of a reference point, which in some “sense” could be considered representative for the entire population. This function indicates how much all the values differ from the point that is considered “typical”.Of all measures of variability, the variance is a well-known example that use the mean as a reference point. However, mean-based measures depend to the scale applied to the categories (Allison & Foster, 2004) and are highly sensitive to outliers. An alternative approach is to compare the distribution of an ordinal variable with that of a maximum dispersion, that is the two-point extreme distribution (i.e A distribution in which half of the population is concentrated in the lowest category and half in the top category). Using this procedure, three measures of variation for ordinal categorical data have been suggested, the Linear Order of Variation - LOV (Berry & Mielke, 1992), the Index Order of Variation - IOV (Leik, 1966) and the COV (Kvalseth, Coefficients of variations for nominal and ordinal categorical data., 1990). All these indices are based on the cumulative relative frequency distribution (CDF), since this contains all the distributional information of any ordinal variable (Blair & Lacy, 1996). Consequently, none of these measures rely on ordinal assumptions about distances between categories.At this point, the reader might wonder if this approach to dispersion is adequate to define the functional form of a polarization measure. “Why not measure the dispersion of the observed distribution as the distance from a point of minimal dispersion?”. This question has been addressed by Blair and Lacy in the article Statistics of ordinal variation (Blair & Lacy, 2000). They argue that this approach is impractical since there are as many one-point distribution as the number of categories. Therefore, it would not be clear, from which one we should calculate the distance.Then, is there another way to compare the dispersion of a distribution that does not depend on its location? Is the space of the cumulative frequency vector the only way to represent all possible distributions?To address this challenge, I propose a new representation of probability measures, the Bilateral Cumulative Distribution Function (BCDF), which derives from a generalization of the CDF. Basically, it is an extended CDF that can be easily obtained by folding its upper part, commonly known as survival function or complementary CDF. Unlike the CDF, this functional has a finite constant area independently of the probability distribution (pdf) and, therefore, more convenient for any distribution comparison. For the definition, properties and computation of the BCDF see Appendix A3:(Pinzari et all 2019)On this basis, to capture the amount of fluctuations about the mean and simultaneously the local variation around the median, we completely defined the shape of a probability distribution by its BCDF autocorrelation function (BCDFA). The BCDFA is a symmetric function that attains the MAD (Median Absolute Deviation) as its maximum and preserved the variance of the pdf. It follows that the maximum extent to which a BCDFA is stretched occurs when the mass probability is evenly concentrated at the end points of the distribution. In such a configuration, the variance of any bounded probability distribution is maximum (Bathia & Chander, 2000). On the other hand, the minimum support is attained when all the values fall in a single category. The main advantage of this representation is that it is invariant to the location of a distribution and therefore is only sensitive to the distribution shape. For example, any singleton distribution will have the same BCDFA curve. Similarly, distributions with same shape but different means and medians can be represented with a unique curve.For instance, the figure above shows the BCDFA for a family of two -point distributions over ten categories. The curve tau_0 represents the nine distributions uniformly distributed over two contiguous categories (The histogram on the left bottom corner illustrates a member of this bimodal class of distributions). On the other hand, the curve tau_8 represent the two-point extreme distribution. Lastly, the red and light blue curves represent the singletons and uniform distribution, respectively. In this way, to quantify the distance between a pdf and the singleton distribution, it is only a matter of choosing an appropriate metric, which would assign larger values to distributions with more dispersed BCDFA than the singleton. For the definition and computation of the BCDFA see Appendix A3.The selection of a measure to compare probability distribution is not a trivial matter and usually depends on the objectives. In this work, I propose the use of the Jensen-Shannon Divergence (Lin, 1991) . Since its definition is based on the BCDFA as opposed to density functions or CDF, is more regular than the variance, COV, IOV and LOV. In addition, unlike the other dispersion indices, this measure does not need to be normalized since is a bounded value in the unit interval. Another crucial characteristic of this metric is that it is infinite differentiable and its derivates are slowly decreasing than any power metric. A function with this characteristic is called tempered distribution or dual Schwartz functions (Stein & Shakarchi, 2003 p.134). Clearly, there are other functions that belong to this functional space (Taneja, 2001) (Jenssen, Principe, & Erologmus, 2006). However, the JSD is a well-known divergence and its square root is a metric (Endres & Schindelin, 2003). This last property will allow us to increase or reduce the magnitude of the divergence. Without loss of generality I call this class of functions, Divergence Index (DI).Thank you for reading this technical post.All the code required to compute the BCDF, ABCDF and DI is available on my GitHub account:https://github.com/lpinzari/homogeneity-location-indexSee More

In the statistical literature, for ordinal types of data, are known lots of indicators to measure the degree of the polarization phenomenon. Typically, many of the widely used measures of distributional variability are defined as a function of a reference point, which in some “sense” could be considered representative for the entire population. This function indicates how much all the values differ from the point that is considered “typical”.Of all measures of variability, the variance is a well-known example that use the mean as a reference point. However, mean-based measures depend to the scale applied to the categories (Allison & Foster, 2004) and are highly sensitive to outliers. An alternative approach is to compare the distribution of an ordinal variable with that of a maximum dispersion, that is the two-point extreme distribution (i.e A distribution in which half of the population is concentrated in the lowest category and half in the top category). Using this procedure, three measures of variation for ordinal categorical data have been suggested, the Linear Order of Variation - LOV (Berry & Mielke, 1992), the Index Order of Variation - IOV (Leik, 1966) and the COV (Kvalseth, Coefficients of variations for nominal and ordinal categorical data., 1990). All these indices are based on the cumulative relative frequency distribution (CDF), since this contains all the distributional information of any ordinal variable (Blair & Lacy, 1996). Consequently, none of these measures rely on ordinal assumptions about distances between categories.At this point, the reader might wonder if this approach to dispersion is adequate to define the functional form of a polarization measure. “Why not measure the dispersion of the observed distribution as the distance from a point of minimal dispersion?”. This question has been addressed by Blair and Lacy in the article Statistics of ordinal variation (Blair & Lacy, 2000). They argue that this approach is impractical since there are as many one-point distribution as the number of categories. Therefore, it would not be clear, from which one we should calculate the distance.Then, is there another way to compare the dispersion of a distribution that does not depend on its location? Is the space of the cumulative frequency vector the only way to represent all possible distributions?To address this challenge, I propose a new representation of probability measures, the Bilateral Cumulative Distribution Function (BCDF), which derives from a generalization of the CDF. Basically, it is an extended CDF that can be easily obtained by folding its upper part, commonly known as survival function or complementary CDF. Unlike the CDF, this functional has a finite constant area independently of the probability distribution (pdf) and, therefore, more convenient for any distribution comparison. For the definition, properties and computation of the BCDF see Appendix A3:(Pinzari et all 2019)On this basis, to capture the amount of fluctuations about the mean and simultaneously the local variation around the median, we completely defined the shape of a probability distribution by its BCDF autocorrelation function (BCDFA). The BCDFA is a symmetric function that attains the MAD (Median Absolute Deviation) as its maximum and preserved the variance of the pdf. It follows that the maximum extent to which a BCDFA is stretched occurs when the mass probability is evenly concentrated at the end points of the distribution. In such a configuration, the variance of any bounded probability distribution is maximum (Bathia & Chander, 2000). On the other hand, the minimum support is attained when all the values fall in a single category. The main advantage of this representation is that it is invariant to the location of a distribution and therefore is only sensitive to the distribution shape. For example, any singleton distribution will have the same BCDFA curve. Similarly, distributions with same shape but different means and medians can be represented with a unique curve.For instance, the figure above shows the BCDFA for a family of two -point distributions over ten categories. The curve tau_0 represents the nine distributions uniformly distributed over two contiguous categories (The histogram on the left bottom corner illustrates a member of this bimodal class of distributions). On the other hand, the curve tau_8 represent the two-point extreme distribution. Lastly, the red and light blue curves represent the singletons and uniform distribution, respectively. In this way, to quantify the distance between a pdf and the singleton distribution, it is only a matter of choosing an appropriate metric, which would assign larger values to distributions with more dispersed BCDFA than the singleton. For the definition and computation of the BCDFA see Appendix A3.The selection of a measure to compare probability distribution is not a trivial matter and usually depends on the objectives. In this work, I propose the use of the Jensen-Shannon Divergence (Lin, 1991) . Since its definition is based on the BCDFA as opposed to density functions or CDF, is more regular than the variance, COV, IOV and LOV. In addition, unlike the other dispersion indices, this measure does not need to be normalized since is a bounded value in the unit interval. Another crucial characteristic of this metric is that it is infinite differentiable and its derivates are slowly decreasing than any power metric. A function with this characteristic is called tempered distribution or dual Schwartz functions (Stein & Shakarchi, 2003 p.134). Clearly, there are other functions that belong to this functional space (Taneja, 2001) (Jenssen, Principe, & Erologmus, 2006). However, the JSD is a well-known divergence and its square root is a metric (Endres & Schindelin, 2003). This last property will allow us to increase or reduce the magnitude of the divergence. Without loss of generality I call this class of functions, Divergence Index (DI).Thank you for reading this technical post.All the code required to compute the BCDF, ABCDF and DI is available on my GitHub account:https://github.com/lpinzari/homogeneity-location-indexSee More

The analysis and classification of ordinal categorical data are central in most scientific domains and ubiquitous in governments and businesses.Examples of ordinal data are either found in questionnaires for measuring opinions or self-reported health status. A well-known example of ordinal data is the Likert Scale [1](DISLIKE = 1, DISLIKE SOMEWHAT = 2, NEUTRAL = 3, LIKE SOME WHAT = 4, LIKE = 5).Other examples are age measured in years (0-20, 21-40, 41-60, 61-80, above 80), body mass index (BMI) measured as (< 18.5, 18.5 - 24.9, 25 - 29, >= 30) for (underweight, normal weight, overweight, obese) or income categories and socioeconomic indices grouped in quantiles (e.g quintiles or deciles)."In all cases, ordinal scales result when inherently continuous variables are measured or summarized by analysts by collapsing the possible values into a set of categories"[2]A particular difficulty in dealing with distributions of ordinal data is to specify the concept of dispersion and to define a measure that has adequate properties. Recently, researchers have acknowledged this problem and addresses the issue of measuring the dispersion of ordinal data based on frequency distribution [3].Following this approach, we introduce an easy to use statistical framework for the identification and classification of homogeneous distributions. We propose an Homogeneity and Location Index to measure the concentration and central value of an ordinal categorical distribution. We also provide a transparent set of criteria that a user can follow to establish if a given Homogeneity's value indicates a "high" or "low" concentration of values around the central value of a distribution.We applied our framework to assess the socioeconomic homogeneity of the commonly used SA3 Australian Census Geography.Figure 1: Conceptual framework for the classification of homogeneous areas. Source A Framework for the classification and identification of homogeneous socioeconomic areas in the analysis of health care variation.In Fig. 1, we illustrate the proposed conceptual framework that could be useful for the evaluation of homogeneous areas in health geographic studies. The first decision is the selection of the larger geographic area (e.g. SA3) and its subunits (e.g. SA1: A smaller ABS geography). Then, the contextual dimension along which one wishes to measure the homogeneity of the geographic area must be defined (e.g. SEIFA, Socio-Economic Indexes for Areas). Third, the selection of the variable used in the model must be specified since measuring the homogeneity among multiple unordered or multiple ordered categories of a variable needs a different set of measurement tools (e.g. IRSD decile). Finally, the selection of the statistical model used to represent the distributional characteristics of the area. We are interested in measuring and operationalising the distribution of a categorical ordinal variable such as the proportion of people in each decile category of the IRSD.This set of analyses uses SA3s to assess the homogeneity of a geographic area. However, the approach can be used to evaluate the socioeconomic homogeneity across any specified geographical boundaries. It is important to notice that the methodology does not require access to fine geographic scale data, and it is easily applied to any distribution of a categorical ordinal variable. Therefore, it requires only the distribution of the attributes for the larger area.Our approach is founded on the general theory of probability distributions, and our aim is to provide a natural benchmark for a homogeneity measure in terms of what is a “high” (i.e. homogeneous) and “low” (i.e. heterogeneous) concentration of a probability distribution. Currently, there is no accepted benchmark that could be used to assess the homogeneity of a categorical ordinal variable. In this work, we show how the proposed statistical indices can be used to investigate the diversity of a geographic area and determine when the unit of analysis should not be used for reporting health outcomes by socioeconomic status.The R code and data sets are available on my GitHub account: homogeneity-location-indexThe scripts also include statistical utilities to compute:convolutionautocorrelationGini IndexI hope my work could be beneficial to any organisation or scientific community involved in classification problems.See More

The analysis and classification of ordinal categorical data are central in most scientific domains and ubiquitous in governments and businesses.Examples of ordinal data are either found in questionnaires for measuring opinions or self-reported health status. A well-known example of ordinal data is the Likert Scale [1](DISLIKE = 1, DISLIKE SOMEWHAT = 2, NEUTRAL = 3, LIKE SOME WHAT = 4, LIKE = 5).Other examples are age measured in years (0-20, 21-40, 41-60, 61-80, above 80), body mass index (BMI) measured as (< 18.5, 18.5 - 24.9, 25 - 29, >= 30) for (underweight, normal weight, overweight, obese) or income categories and socioeconomic indices grouped in quantiles (e.g quintiles or deciles)."In all cases, ordinal scales result when inherently continuous variables are measured or summarized by analysts by collapsing the possible values into a set of categories"[2]A particular difficulty in dealing with distributions of ordinal data is to specify the concept of dispersion and to define a measure that has adequate properties. Recently, researchers have acknowledged this problem and addresses the issue of measuring the dispersion of ordinal data based on frequency distribution [3].Following this approach, we introduce an easy to use statistical framework for the identification and classification of homogeneous distributions. We propose an Homogeneity and Location Index to measure the concentration and central value of an ordinal categorical distribution. We also provide a transparent set of criteria that a user can follow to establish if a given Homogeneity's value indicates a "high" or "low" concentration of values around the central value of a distribution.We applied our framework to assess the socioeconomic homogeneity of the commonly used SA3 Australian Census Geography.Figure 1: Conceptual framework for the classification of homogeneous areas. Source A Framework for the classification and identification of homogeneous socioeconomic areas in the analysis of health care variation.In Fig. 1, we illustrate the proposed conceptual framework that could be useful for the evaluation of homogeneous areas in health geographic studies. The first decision is the selection of the larger geographic area (e.g. SA3) and its subunits (e.g. SA1: A smaller ABS geography). Then, the contextual dimension along which one wishes to measure the homogeneity of the geographic area must be defined (e.g. SEIFA, Socio-Economic Indexes for Areas). Third, the selection of the variable used in the model must be specified since measuring the homogeneity among multiple unordered or multiple ordered categories of a variable needs a different set of measurement tools (e.g. IRSD decile). Finally, the selection of the statistical model used to represent the distributional characteristics of the area. We are interested in measuring and operationalising the distribution of a categorical ordinal variable such as the proportion of people in each decile category of the IRSD.This set of analyses uses SA3s to assess the homogeneity of a geographic area. However, the approach can be used to evaluate the socioeconomic homogeneity across any specified geographical boundaries. It is important to notice that the methodology does not require access to fine geographic scale data, and it is easily applied to any distribution of a categorical ordinal variable. Therefore, it requires only the distribution of the attributes for the larger area.Our approach is founded on the general theory of probability distributions, and our aim is to provide a natural benchmark for a homogeneity measure in terms of what is a “high” (i.e. homogeneous) and “low” (i.e. heterogeneous) concentration of a probability distribution. Currently, there is no accepted benchmark that could be used to assess the homogeneity of a categorical ordinal variable. In this work, we show how the proposed statistical indices can be used to investigate the diversity of a geographic area and determine when the unit of analysis should not be used for reporting health outcomes by socioeconomic status.The R code and data sets are available on my GitHub account: homogeneity-location-indexThe scripts also include statistical utilities to compute:convolutionautocorrelationGini IndexI hope my work could be beneficial to any organisation or scientific community involved in classification problems.See More