Ok, I'm turning to the expertise on this forum to sanity check what I believe I've chanced upon.

To put us on the same page the goal for a logistic regression is to determine a vector of real numbers which will serve as coefficients producing a linear combination of the feature vector values (plus a constant). 

That said, I'm now going to present a learning question and a simplified data set, and a naive approach to engineering a feature for the training. I'm then going to point out why based on the algorithm, I think this feature engineering choice is a bad fit for this algorithm and what approach I think is better suited. 

I'm brand new to ML/Stats and I'm hoping that someone with more experience/insight can validate where I am on track and point out where I am off track, so here goes:

I have a a set of made phone calls, labeled as 'picked up' and 'did not pick up', or {1,0}, respectively. 

I have a timestamp for when the call was made. Let's assume all callees are in the same timezone to simplify everything. 

Learning Question: how does the hour of day of the call affect the pick up rate for the call P( y = 1 | x) where x = hour of day

Here's the naive approach to engineering the timestamp feature: 

Transform the timestamp into the hour of the day [1,24], and train the model with it.

Here's where I think this is a very bad way to engineer the hour of day feature:

Since all the algorithm will do is find the optimal coefficient for hour of day, it has no choice but to see call rate (more or less, depending on feature cardinality) as a linear function of hour of day. Meaning that the coefficient can be set to reward either a higher value or a lower value, with the predictive assumption being that the probability for a pick up will be directly or inversely proportional to the integer value of day. 

Thus this feature engineering approach makes it so that model cannot speak at all to a probability distribution that is not linear, like say when calls made at hour 8 have a relatively high pick up rate, but calls made at 9 do not, but those made at 12 do, and so forth...

I don't think this changes when there are more features, because at the end of the day the algorithm is limited to defining a scalar to represent the relationship between any given feature and the target label. 

QUESTION: Does this makes sense? Is this analysis correct or not or something in between? 

A better approach (I'd like to say even a good approach, please tell me what you think) is to one hot encode the hour of day such that hour of day becomes an individual feature with its own fine tuned coefficient. 

QUESTION: Does this makes sense? Is it a better approach? A good approach or not or something in between? 

Tags: feature-engineering, logistic-regression

Views: 1077

Reply to This

Replies to This Discussion

Hi Naftali, 

what I would suggest you is to flat down the number of days down to separate columns and mark them as 1 or 0 depending on the success of the targeted predicate.

best regards,


My immediate reaction was "Don't one-hot encode. That destroys the ordinal information of the variable". In principle, one should not throw information away. But you're right, it's unlikely that the response function would be linear. So go ahead, but explore a range of encodings, i.e. rather than one-hot 1:24, code into 3 or 4 hour chunks, for example, or meaningful ranges like before work/school, business hours, etc.

Thanks for sharing


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service