To put us on the same page the goal for a logistic regression is to determine a vector of real numbers which will serve as coefficients producing a linear combination of the feature vector values (plus a constant).

That said, I'm now going to present a learning question and a simplified data set, and a naive approach to engineering a feature for the training. I'm then going to point out why based on the algorithm, I think this feature engineering choice is a bad fit for this algorithm and what approach I think is better suited.

I'm brand new to ML/Stats and I'm hoping that someone with more experience/insight can validate where I am on track and point out where I am off track, so here goes:

I have a a set of made phone calls, labeled as 'picked up' and 'did not pick up', or `{1,0}`

, respectively.

I have a timestamp for when the call was made. Let's assume all callees are in the same timezone to simplify everything.

Learning Question: how does the hour of day of the call affect the pick up rate for the call `P( y = 1 | x)`

where `x = hour of day`

.

Here's the naive approach to engineering the timestamp feature:

Transform the timestamp into the hour of the day `[1,24]`

, and train the model with it.

Here's where I think this is a very bad way to engineer the hour of day feature:

Since all the algorithm will do is find the optimal coefficient for hour of day, it has no choice but to see call rate (more or less, depending on feature cardinality) as a linear function of hour of day. Meaning that the coefficient can be set to reward either a higher value or a lower value, with the predictive assumption being that the probability for a pick up will be directly or inversely proportional to the integer value of day.

Thus this feature engineering approach makes it so that model cannot speak at all to a probability distribution that is not linear, like say when calls made at hour 8 have a relatively high pick up rate, but calls made at 9 do not, but those made at 12 do, and so forth...

I don't think this changes when there are more features, because at the end of the day the algorithm is limited to defining a scalar to represent the relationship between any given feature and the target label.

QUESTION: Does this makes sense? Is this analysis correct or not or something in between?

A better approach (I'd like to say even a good approach, please tell me what you think) is to one hot encode the hour of day such that hour of day becomes an individual feature with its own fine tuned coefficient.

QUESTION: Does this makes sense? Is it a better approach? A good approach or not or something in between?