At a conference I attended a few years ago, a data scientist on a round table discussion replied to a question of what she considered the most important mathematical function in her work with: "the division operator". That clever response provided grist for my later answer to a similar question on my favorite statistical procedure: "frequencies and crosstabs". The commonality is, of course, the simplicity and ubiquity of the functions.

I spend much of my current analytics time in what used to be called exploratory data analysis (EDA) or now just data analysis. DA sits between business intelligence and statistical modeling, using comprehensible computations and visualizations to tell data stories. Among the leading "statistics" are simple counts or frequencies, and their multivariate analogs, crosstabs or contingency tables. Actually for me, they're all just frequencies, be they uni or multi-attribute.

Counts and frequencies play a foundational role in statistical analysis. In my early career, I used Poisson regression extensively. "Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables." In later years, my emphasis has been more on time series analysis, where count data such as visits, hits, defections, etc. are central.

The analysis that follows is all about frequencies and was done in R using its splendid data.table and tidyverse capabilities for data analysis. I've also done similar computations in Python/pandas and am confident the work could be done as well with standard SQL. Indeed, I believe most good BI/OLAP tools can handle the demands. I know several Tableau geeks who'd say it's a piece of cake!

Why another frequencies function in R? After all, there are the table and xtabs functions from base, count from plyr, and countless others from lesser-know packages. The answer is simple: frequenciesdyn is built on data.table, a very powerful and flexible data management addon package that performs group computations (e.g. frequencies) faster than others. It also fits nicely in tidyverse pipelines.

A data set on crime in Chicago is used for the analyses. The data, representing all reported crime in Chicago from 2001, are updated daily and posted a week in arrears. Attributes revolve on the what, where, and when of crime events. The file at this point consists of over 6.5M records.

The technologies deployed below are JupyterLab running an R 3.4 kernel. The scripts are driven primarily through the R data.table and tidyverse packages. Hopefully, readers will see just how powerful these tools are in collaboration. Notable is that neither data.table nor tidyverse is a part of "core" R; each is an addon maintained by the energetic R ecosystem.

The remainder of the blog can be found here.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central