Is there a way to state and statistically test a hypothesis that requires a minimum performance of a predictive analytic model, such as requiring at least one result has a confidence level of at least 75%?
- - Null hypothesis: "A predictive model utilizing logistic regression cannot predict at least one customer will churn in 90 days, with this individual prediction being at a minimum of 75% confidence, using the selected set of independent variables."
In other words, is there a way to test that the 75% confidence (probability) of a prediction satisfies p < 0.05?
Or, it this the only way to do it, and the minimum performance of the model has be left to the discussion section?
- - Null hypothesis: There is no significant relationship between a full model utilizing logistic regression created with the selected independent variables and customer churn.
- - Alternative hypothesis: There is a significant relationship between a full model utilizing logistic regression created with the selected independent variables and customer churn.
I'm not sure what you mean by an "individual prediction". Statistics is about populations.
There are several parts to your question. Given a model and a test set, you can work out its false positive rate on the test set. On the assumption that the test set is a random sample from the population of future queries, you can work out error bars on the false positive rate for future queries by standard statistical methods. If that's your question, then the answer is yes.
Secondly, given your training set and a choice of method (e.g. logistical regression) can you show that the model you have made is the best possible? With some methods this is hard. But logistical regression has a an algorithmic solution, and few metaparameters (only regularization) so this should be possible.
Thirdly, if these results are unsatisfactory, is it better to try other methods or to collect more training data? That's more of a discussion. You can build models using subset of the training data of different sizes, which will give some evidence about how the training set size is influencing your results.
It can also be useful to make some simulated data, and show how well the chosen method works with it. In your case, this would be some data that conforms to the linearity assumption implicit in linear regression and some that violates it.