Subscribe to DSC Newsletter

Steve miller's Blog (46)

Multi-Dimensional Frequencies with R data.table.

A few years ago, in a Q&A session following a presentation I gave on data analysis (DA) to a group of college recruits for my then consulting company, I was asked to name what I considered the most important analytic technique. Though a surprise to the audience, my answer, counts and frequencies, was a no brainer for…

Continue

Added by steve miller on March 11, 2020 at 10:30am — No Comments

Dataframe Storage Efficiency in Python-Pandas

Summary: It's no secret that Python-Pandas is central to data management for analytics and data science today. Indeed, what we're seeing now is Pandas being extended to handle ever-larger data. Underappreciated is that Pandas is a tunable platform, supporting its own datatypes as well as those from numerical library Numpy. Together, these comprise…

Continue

Added by steve miller on February 18, 2020 at 4:46am — 5 Comments

Multi Gigabyte R data.table for Ohio Voter Registration/History

Summary: This blog details R data.table programming to handle multi-gigabyte data. It shows how the data can be efficiently loaded, "normalized", and counted. Readers can readily copy and enhance the code below for their own analytic needs. An intermediate level of R coding sophistication is assumed.

In my travels over the holidays, I…

Continue

Added by steve miller on January 15, 2020 at 5:29am — No Comments

Using "record id's" to facilitate processing in Python-Pandas and R-data.table.

ID card template example

Both R and Python-Pandas are array-oriented platforms that support fast filtering through vectors of record-id's. In Python-Pandas, such vectors are implemented via Pandas's powerful index construct; in R-data.table, they're accessible through the "which" and "row.name" functions. In both instances, joins to record-id vectors generate fast subsetted access.

How is the record-id vector approach helpful? For starters, the analyst can encapsulate common…

Continue

Added by steve miller on December 13, 2019 at 5:51am — No Comments

Working with Control Breaks Data in R.

Continue

Added by steve miller on November 4, 2019 at 9:04am — No Comments

AWK -- a Blast from Wrangling Past.

I recently came across an interesting account by a practical data scientist on how to munge 25 TB of data. What caught my eye at first was the article's title: "Using AWK and R to parse 25tb". I'm a big R user now and made a living with AWK 30 years ago as a budding data analyst. I also empathized with the author's recountings of…

Continue

Added by steve miller on September 21, 2019 at 5:30am — 2 Comments

Jobs, Unemployment and 45's Performance.

Despite the consuming controversy surrounding his presidency, POTUS 45 has been able to secure solid ratings on the performance of the economy over his so-far 30-month administration. And he certainly isn't bashful about taking credit for the successes, opining loudly and often that his tax cuts and de-regulation initiatives…

Continue

Added by steve miller on September 4, 2019 at 8:39am — No Comments

Using Python and R to Load Relational Database Tables, Part II

Last time I wrote on using Python/Pandas as an adjunct to loading PostgreSQL tables. In this sequel, I demo how R can be used to collaborate with the database in…

Continue

Added by steve miller on August 8, 2019 at 6:22am — No Comments

Using Python and R to Load Relational Database Tables, Part I

I enjoy data prep munging for analyses with computational platforms such as R, Python-Pandas, Julia, Apache Spark, and even relational databases. The wrangling cycle provides the opportunity to get a feel for and preliminarily explore data that are to be later analyzed/modeled.

A critical task I prefer handling in computation over database is…

Continue

Added by steve miller on July 30, 2019 at 6:34am — 1 Comment

Writing/Reading Large R dataframes/data.tables -- Addendum.



After posting my most recent blog using …

Continue

Added by steve miller on July 2, 2019 at 9:00am — No Comments

Writing/Reading Large R dataframes/datatables.

I recently downloaded a 5 year Public Use Microsample (PUMS) from the latest release of the American Community Survey (ACS) census data. The data contain a wealth of demographic information on both American households and…

Continue

Added by steve miller on June 24, 2019 at 12:42pm — 1 Comment

Simulated Significance

I pulled out a dusty copy of Thinking Stats by Allen Downey the other day. I highly recommend this terrific little read that teaches statistics with easily understood examples using Python. When I purchased the book eight years ago, the Python code proved invaluable as…

Continue

Added by steve miller on May 30, 2019 at 7:56am — No Comments

Nowcasting Chicago Crime with Python-Pandas, and R.

In my many years as a data scientist, I've spent more time doing forecast work than any other type of predictive modeling. Often as not, the challenges have involved forecasting demand for an organization's many products/lines of business a year or more out based on five or more years of actual data, generally of daily…

Continue

Added by steve miller on May 7, 2019 at 5:34am — 1 Comment

Frequencies in Pandas Redux

 

A little less than a year ago, I posted a blog on generating multivariate frequencies with the Python Pandas data management library, at the same time showcasing Python/R graphics interoperability. For my…

Continue

Added by steve miller on April 25, 2019 at 5:33am — No Comments

March Madness, KenPom and Python/Pandas.



March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools -- 32 league champions and 36 at large selections -- received invitations to this year's tournament, which starts…

Continue

Added by steve miller on March 18, 2019 at 5:35am — No Comments

A Blast from Python Past -- Part 3

Last time, I posted Part 2 of a blog trilogy on data programming with Python. That article revolved on showcasing …

Continue

Added by steve miller on March 4, 2019 at 9:01am — No Comments

A Blast from Python Past -- Part 2

Last week I posted the first of a three-part series on basic data programming with Python. For that article, I resurrected scripts written 10 years ago that deployed core Python data structures and functions to assemble a Python list for…

Continue

Added by steve miller on February 5, 2019 at 7:55am — No Comments

A Blast from Python Past

I had an interesting discussion with one of my son's friends at a neighborhood gathering over the holidays. He's just reached the halfway point of a Chicago-area Masters in Analytics program and wanted to pick my brain on the state of the discipline.

Of the four major program foci of business, data, computation, and algorithms, he acknowledged…

Continue

Added by steve miller on January 28, 2019 at 8:24am — No Comments

Kicking Chicago with R.

Like most Chicago football fans, I was pretty distraught after the Bears lost last Sunday's playoff game courtesy of a missed field goal at the end -- a kick that first hit the goalpost and then the crossbar before ultimately failing miserably. While most local fans were grief-stricken like me, some were irrationally inconsolable, demanding the…

Continue

Added by steve miller on January 11, 2019 at 8:22am — No Comments

XGBoost with Python -- Part 0

After posting my last blog, I decided next to do a 2-part series on …

Continue

Added by steve miller on December 20, 2018 at 7:30am — 2 Comments

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service