Subscribe to DSC Newsletter

I recently came across an interesting account by a practical data scientist on how to munge 25 TB of data. What caught my eye at first was the article's title: "Using AWK and R to parse 25tb". I'm a big R user now and made a living with AWK 30 years ago as a budding data analyst. I also empathized with the author's recountings of his painful but steady education on working with that volume of data: "I didn’t fail a thousand times, I just discovered a thousand ways not to parse lots of data into an easily query-able format." Been there, done that.

After reading the article, I was again intrigued with AWK after all these years. A Unix-based munging predecessor of perl and python, AWK's particularly adept at working with delimited text files, automatically splitting each record into fields identified as 1, 2, etc. My use of AWK generally revolved on selecting columns (projecting) and rows (filtering) from text files, in turn piping the results to other scripts for additional processing. I found that AWK did these simple tasks very well but didn't scale for more demanding data programming -- remembering well that trouble lurked when I attempted to contort AWK to do something it wasn't intended to do. And indeed, I pretty much abandoned AWK when the more comprehensive perl emerged in the late 80s. In retrospect, I'm not sure that was the best course. Optimal might have been to continue using AWK for the simpler file project and filter work, saving perl (and then python) for more complex tasks.

So I just had to reacquaint myself with AWK, and downloaded the GNU version gawk. I then divined several quick tasks on a pretty large data source to test the language. The data for analyses consist of 4 large files of census information totaling over 14 GB which, in sum, comprise 15.8M records and 286 attributes. I use AWK to project/filter the input data, and then pipe the results to python or R for analytic processing. AWK does some pretty heavy albeit simple processing. In my tests, both R and python/pandas could have handled AWK's tasks as well, but it's not hard to imagine a pipeline that required pre project/filtering.

Unlike other blogs I've written using Jupyter Notebook, this one does not execute in a python or R kernel; rather the notebook simply displays the AWK, python, and R scripts and their outputs.

AWK script to match attribute number with column header name.

dict.awk

gawk -F, 'BEGIN{OFS=","} NR==1 {split($0,dict,",");for (d in dict) print d, dict[d]}' psam_pusa.csv

Command line to run dict.awk (and some of the output).

./dict.awk 2>null | head

1,RT
2,SERIALNO
3,DIVISION
4,SPORDER
5,PUMA
6,REGION
7,ST
8,ADJINC
9,PWGTP
10,AGEP

Basic awk script to filter and project pertinent data from the four pums files.

pums.awk

gawk -F, 'BEGIN{OFS=","} (NR==1 || FNR!=1) && ($10>=18 && $104>1000) \
{print $2,$3,$5,$6,$7,$8,$10,$39,$67,$69,$89,$104,$112}' psam_pusa.csv psam_pusb.csv psam_pusc.csv psam_pusd.csv

Next, a python script that consumes/processes the data piped from awk. The script creates a dataframe from the input data and outputs a high-performance feather file that interoperates with R and Linux OS's.

pums.py

import sys, pandas as pd, numpy as np, feather

import warnings
warnings.filterwarnings('ignore')

functdir = "c:/steve/jupyter/notebooks/functions"
sys.path.append(functdir)

from newmyfuncs import *

fthname = "pumspy.feather"

pums = pd.read_csv(sys.stdin,header=0)
pums.columns = [c.lower() for c in pums.columns]

print(pums.shape,"\n")
print(pums.columns,"\n")
print(pums.dtypes,"\n")

feather.write_dataframe(pums,fthname)

prarr(pums)

time ./pums.awk 2>null | python pums.py

(10925570, 13)

Index(['serialno', 'division', 'puma', 'region', 'st', 'adjinc', 'agep', 'mar',
'schl', 'sex', 'hisp', 'pincp', 'rac1p'],
dtype='object')

serialno int64
division int64
puma int64
region int64
st int64
adjinc int64
agep int64
mar int64
schl int64
sex int64
hisp int64
pincp int64
rac1p int64
dtype: object

    serialno  division  puma  region  st  ...  schl  sex  hisp  pincp  rac1p   

0 2013000000154 6 2500 3 1 ... 20 2 1 52000 2
1 2013000000154 6 2500 3 1 ... 16 1 1 99000 2
2 2013000000156 6 1700 3 1 ... 21 2 1 39930 2
3 2013000000160 6 2200 3 1 ... 14 2 1 10300 3
4 2013000000160 6 2200 3 1 ... 16 1 1 1100 3

[5 rows x 13 columns]

           serialno  division  puma  region  ...  sex  hisp  pincp  rac1p   

10925565 2017001533765 8 300 4 ... 1 1 29800 1
10925566 2017001534830 8 200 4 ... 2 1 9000 1
10925567 2017001534957 8 300 4 ... 1 24 4000 1
10925568 2017001535265 8 200 4 ... 2 1 5000 9
10925569 2017001536156 8 400 4 ... 1 1 2400 1

[5 rows x 13 columns]

real 1m9.569s
user 1m1.576s
sys 0m2.359s

Now a similar exercise with R. The same pums.awk feeds the R script.

pums.r

suppressMessages(library(readr))
suppressMessages(library(data.table))
suppressMessages(library(fst))
suppressMessages(library(feather))

wd <- "c:/bigdata/steve/awk"
setwd(wd)

fstnme <- "pumsfst.fst"
fthnme <- "pumsfth.feather"

f <- file("stdin")
open(f)
pums <- data.table(read_csv(readLines(f,n=-1)))

setnames(pums,tolower(names(pums)))

print(dim(pums))
print(str(pums))
head(pums)
tail(pums)

write_fst(pums,fstnme)
write_feather(pums, fthnme)

Command line to run pums.awk and pipe output to pums.r. Output included. 125 seconds that includes writing feather and fst files. Note that the R readr loader makes different data type choices than pandas.

time ./pums.awk 2>null | rscript pums.r

[1] 10925570 13 Classes 'data.table' and 'data.frame': 10925570 obs. of 13 variables:
serialno:num2.01e+122.01e+122.01e+122.01e+122.01e+12... division: num 6 6 6 6 6 6 6 6 6 6 ...
puma:chr"02500""02500""01700""02200"... region : num 3 3 3 3 3 3 3 3 3 3 ...
st:chr"01""01""01""01"... adjinc : num 1061971 1061971 1061971 1061971 1061971 ...
agep:num55566361207881595670... mar : num 1 1 3 4 5 2 2 1 1 1 ...
schl:chr"20""16""21""14"... sex : num 2 1 2 2 1 2 2 1 2 1 ...
hisp:chr"01""01""01""01"... pincp : chr "000052000" "000099000" "000039930" "000010300" ...
$ rac1p : num 2 2 2 3 3 2 1 1 1 1 ...

  • attr(*, ".internal.selfref")=
    serialno division puma region st adjinc agep mar schl sex hisp pincp
    1: 2.013e+12 6 02500 3 01 1061971 55 1 20 2 01 000052000
    2: 2.013e+12 6 02500 3 01 1061971 56 1 16 1 01 000099000
    3: 2.013e+12 6 01700 3 01 1061971 63 3 21 2 01 000039930
    4: 2.013e+12 6 02200 3 01 1061971 61 4 14 2 01 000010300
    5: 2.013e+12 6 02200 3 01 1061971 20 5 16 1 01 000001100
    6: 2.013e+12 6 02400 3 01 1061971 78 2 01 2 01 000003900
    rac1p 1: 2
    2: 2
    3: 2
    4: 3
    5: 3
    6: 2
    serialno division  puma region st  adjinc agep mar schl sex hisp 
    1: 2.017002e+12 8 00500 4 56 1011189 51 3 20 1 01
    2: 2.017002e+12 8 00300 4 56 1011189 19 5 18 1 01
    3: 2.017002e+12 8 00200 4 56 1011189 20 5 19 2 01
    4: 2.017002e+12 8 00300 4 56 1011189 18 5 16 1 24
    5: 2.017002e+12 8 00200 4 56 1011189 31 5 18 2 01
    6: 2.017002e+12 8 00400 4 56 1011189 60 3 19 1 01
    pincp rac1p 
    1: 000035300 1
    2: 000029800 1
    3: 000009000 1
    4: 000004000 1
    5: 000005000 9
    6: 000002400 1

real 1m53.440s
user 1m0.655s
sys 0m2.779s

Finally, a more sophisticated pums.awk that uses associative arrays. This works for command lines using both pums.py and pums.r as above.

pums.awk

gawk -F, 'BEGIN{OFS=","} NR==1 {split($0,dict,",");for (d in dict) newdict[dict[d]] = d}\ (NR==1 || FNR!=1) && ($newdict["AGEP"]>=18 && $newdict["PINCP"]>1000)\ $newdict["SERIALNO"],$newdict["DIVISION"],$newdict["REGION"],$newdict["ST"],$newdict["PUMA"],$newdict["AGEP"],$newdict["HICOV"],\ $newdict["MAR"],$newdict["RAC1P"],$newdict["HISP"],$newdict["SEX"],$newdict["SCHL"],$newdict["PINCP"],$newdict["ADJINC"]}' psam_pusa.csv psam_pusb.csv psam_pusc.csv psam_pusd.csv

Fun. My take is that dusty AWK can indeed serve an important role in 2019 data pipelines -- provided that it's used simply and to its projecting/filtering strengths. That's it for now. Off to Octoberfest. Happy wrangling!

The technology used below is Windows 10, JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, R 3.6.0, Cgywin 3.0.7, and GNU Awk (gawk) 5.0.1. All gawk, python, and R scripts are simply components in pipelines generated from bash shell command lines in Cgywin windows.

The original post can be viewed here.

Views: 285

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by John L. Ries on September 23, 2019 at 9:20pm

I never stopped using AWK.  If patterns work, then I use it and it works nicely.  If I need more complex text processing, I use Perl or (more recently) Python.

I use sed too, but only for the simplest text processing.

UNIX toys are my friends.

Comment by Vincent Granville on September 21, 2019 at 9:04am

Wow, that reminds me old memories, like SED (stream editor.)

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service