]]>

]]>

]]>

I recently came across an interesting account by a practical data scientist on how to munge 25 TB of data. What caught my eye at first was the article's title: "Using AWK and R to parse 25tb". I'm a big R user now and made a living with AWK 30 years ago as a budding data analyst. I also empathized with the author's recountings of his painful but steady education on working with that volume of data: "I didn’t fail a thousand times, I just discovered a thousand ways not to parse lots of data into an easily query-able format." Been there, done that.After reading the article, I was again intrigued with AWK after all these years. A Unix-based munging predecessor of perl and python, AWK's particularly adept at working with delimited text files, automatically splitting each record into fields identified as 1, 2, etc. My use of AWK generally revolved on selecting columns (projecting) and rows (filtering) from text files, in turn piping the results to other scripts for additional processing. I found that AWK did these simple tasks very well but didn't scale for more demanding data programming -- remembering well that trouble lurked when I attempted to contort AWK to do something it wasn't intended to do. And indeed, I pretty much abandoned AWK when the more comprehensive perl emerged in the late 80s. In retrospect, I'm not sure that was the best course. Optimal might have been to continue using AWK for the simpler file project and filter work, saving perl (and then python) for more complex tasks.So I just had to reacquaint myself with AWK, and downloaded the GNU version gawk. I then divined several quick tasks on a pretty large data source to test the language. The data for analyses consist of 4 large files of census information totaling over 14 GB which, in sum, comprise 15.8M records and 286 attributes. I use AWK to project/filter the input data, and then pipe the results to python or R for analytic processing. AWK does some pretty heavy albeit simple processing. In my tests, both R and python/pandas could have handled AWK's tasks as well, but it's not hard to imagine a pipeline that required pre project/filtering.Unlike other blogs I've written using Jupyter Notebook, this one does not execute in a python or R kernel; rather the notebook simply displays the AWK, python, and R scripts and their outputs.The technology used below is Windows 10, JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, R 3.6.0, Cgywin 3.0.7, and GNU Awk (gawk) 5.0.1. All gawk, python, and R scripts are simply components in pipelines generated from bash shell command lines in Cgywin windows.Read the entire blog here.See More

]]>

]]>

]]>

]]>

]]>

The explanation of Logistic Regression as a Generalized Linear Model and use as a classifier is often confusing.In this article, I try to explain this idea from first principles. This blog is part of my forthcoming book on the Mathematical foundations of Data Science. If you are interested in knowing more, please follow me on linkedin Ajit JaokarWe take the following approach:We see first briefly how linear regression worksWe then explore the assumptions and limitations of linear regression.Following this, we then show how these limitations can be overcome by the Generalized Linear Model (GLM)Finally, we explore Logistic Regression as a GLMExplanation of Linear RegressionMachine learning involves creating a model of a process. To create a model of a process, we need to identify patterns in data. Broadly, patterns in data can be of two types: The signal (data generating process) and the variation (error generation process). The simplest model to start off with is the Linear Regression model. Linear models have some advantages – for example, they are relatively simple to implement, and many phenomenon can be modelled using linear regressionAssumptions of Linear RegressionLinear regression has the following requirements (assumptions for use)As per the name, Linear regression needs the relationship between the independent and dependent variables to be linear.the linear regression analysis requires all variables to be multivariate normal distribution.No multicollinearity in the data. Multicollinearity occurs when the independent variables are highly correlated with each other.linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation is the correlation of a signal with a delayed copy of itself as a function of delay.Homoscedasticity (the residuals are equal across the regression line). Overcoming the requirement that the dependent(response) variable is of normal distributionThe requirement that the response variable is of normal distribution excludes many cases such as:Where the response variable is expected to be always positive and varying over a wide range orWhere constant input changes lead to geometrically varying, rather than continually varying, output changes. We can illustrate these using examples: Suppose we have a model which predicts that a 10 degree temperature decrease would lead to 1,000 fewer people visiting the beach. This model does not work over small and large beaches. (Here, we could consider a small beach as one where expected attendance is 50 people and a large beach as one where the expected attendance was 10,000.). For the small beach (50 people), the model implies that -950 people would attend the beach. This prediction is obviously not correct. This model would also not work if we had a situation where we had an output that was bounded on both sides – for example in the case of a yes/no choice. This is represented by a Bernoulli variable where the probabilities are bounded on both ends (they must be between 0 and 1). If our model predicted that a change in 10 degrees makes a person twice as likely to go to the beach, as temperatures increases by 10 degrees, this model does not work because probabilities cannot be doubled. Generalised linear models (GLM) cater to these situations by allowing for response variables that have arbitrary distributions (other than only normal distributions), and by using an arbitrary function of the response variable (called the link function) to vary linearly with the predicted values (rather than assuming that the response itself must vary linearly with the predictor). Thus, in a generalised linear model (GLM), each outcome Y of the dependent variables is assumed to be generated from the exponential family of distributions (which includes distributions such as the normal, binomial, Poisson and gamma distributions, among others). GLM thus expands the scenarios in which linear regression can apply by expanding the possibilities of the outcome variable. GLM uses the maximum likelihood estimation of the model parameters for the exponential family and least squares for normal linear models. (Note the section is adapted from Wikipedia) Logistic Regression as GLMTo understand how logistic regression can be seen as GLM, we can elaborate this approach as follows: Logistic regression measures the relationship between the dependent variable and one or more independent variables(features) by estimating probabilities using the underlying logit function. In statistics, the logit function or the log-odds is the logarithm of the odds. Given a probability p, the corresponding odds are calculated as p / (1 – p). For The logit function is the logarithm of the odds: logit(x) = log(x / (1 – x)). The Odds describes the ratio of success to ratio of failure. The Odds ratio is the ratio of odds and is calculated as the ratio of odds for each group. The inverse of the logit function is the sigmoid function. The formula for the sigmoid function is σ(x) = 1/(1 + exp(-x)). The sigmoid function maps probabilities to the range [0, 1] – and this makes logistic regression as a classifier. Thus, many models have data generating processes that can be linearized by considering the inverse The logit and the sigmoid functions are useful in analysis because their gradients are simple to calculate. Many optimization and machine learning techniques make use of gradients ( for example in neural networks). The biggest drawback of the sigmoid function for many analytics practitioners is the so-called “vanishing gradient” problem. This blog is part of my forthcoming book on the Mathematical foundations of Data Science. If you are interested in knowing more, please follow me on linkedin Ajit JaokarReferences byrneslab.netlogit of logistic regression understanding the fundamentalsnathanbrixius -logit-and-sigmoidquora.com – Why is logistic regression called regression if it doesn’t model continuous outcomesSee More