The analysis and classification of ordinal categorical data are central in most scientific domains and ubiquitous in governments and businesses.

Examples of ordinal data are either found in questionnaires for measuring opinions or self-reported health status. A well-known example of ordinal data is the Likert Scale [1]

(DISLIKE = **1**, DISLIKE SOMEWHAT = **2**, NEUTRAL = **3**, LIKE SOME WHAT = **4**, LIKE = **5**).

Other examples are age measured in years (0-20, 21-40, 41-60, 61-80, above 80), body mass index (BMI) measured as (< 18.5, 18.5 - 24.9, 25 - 29, >= 30) for (underweight, normal weight, overweight, obese) or income categories and socioeconomic indices grouped in quantiles (e.g quintiles or deciles).

"In all cases, ordinal scales result when inherently continuous variables are measured or summarized by analysts by collapsing the possible values into a set of categories"[2]

A particular difficulty in dealing with distributions of ordinal data is to specify the concept of dispersion and to define a measure that has adequate properties. Recently, researchers have acknowledged this problem and addresses the issue of measuring the dispersion of ordinal data based on frequency distribution [3].

Following this approach, we introduce an easy to use ** statistical framework** for the identification and classification of

We applied our framework to assess the socioeconomic homogeneity of the commonly used SA3 Australian Census Geography.

**Figure 1**: Conceptual framework for the classification of homogeneous areas. Source *A Framework for the classification and identification of homogeneous socioeconomic areas in the analysis of health care variation**.*

In **Fig. ****1**, we illustrate the proposed conceptual framework that could be useful for the evaluation of homogeneous areas in health geographic studies. The first decision is the selection of the larger geographic area (e.g. SA3) and its subunits (e.g. SA1: A smaller ABS geography). Then, the contextual dimension along which one wishes to measure the homogeneity of the geographic area must be defined (e.g. SEIFA, Socio-Economic Indexes for Areas). Third, the selection of the variable used in the model must be specified since measuring the homogeneity among multiple unordered or multiple ordered categories of a variable needs a different set of measurement tools (e.g. IRSD decile). Finally, the selection of the statistical model used to represent the distributional characteristics of the area. We are interested in measuring and operationalising the distribution of a categorical ordinal variable such as the proportion of people in each decile category of the IRSD.

This set of analyses uses SA3s to assess the homogeneity of a geographic area. However, the approach can be used to evaluate the socioeconomic homogeneity across any specified geographical boundaries. It is important to notice that the methodology does not require access to fine geographic scale data, and it is easily applied to any distribution of a categorical ordinal variable. Therefore, it requires only the distribution of the attributes for the larger area.

Our approach is founded on the general theory of probability distributions, and our aim is to provide a natural benchmark for a homogeneity measure in terms of what is a “high” (i.e. homogeneous) and “low” (i.e. heterogeneous) concentration of a probability distribution. Currently, ** there is no accepted benchmark that could be used to assess the homogeneity of a categorical ordinal variable**. In this work, we show how the proposed statistical indices can be used to investigate the diversity of a geographic area and determine when the unit of analysis should not be used for reporting health outcomes by socioeconomic status.

The R code and data sets are available on my GitHub account: homogeneity-location-index

The scripts also include statistical utilities to compute:

- convolution
- autocorrelation
- Gini Index

I hope my work could be beneficial to any organisation or scientific community involved in classification problems.

Views: 1046

Tags: #Rprogramming, #classification, #clustering, #datamining, #datascience, #geography, #gitHub, #health, #modeling, #opensource, More…#peergroups, #research, #spatialdata, #statistics

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central