This is not very simple to choose a machine learning method and letting it go wild on the data. Particularly, understanding the core business problem and objective of the outcome and frame accordingly is one of the vital factors in machine learning. A general approach is difficult to recommend without intimate knowledge of the data. However, it sounds like we need to formalize the aspects of your model. Following questions may help to decide on machine learning problem or otherwise:
Machine learning comprises of supervised, unsupervised, semi- supervised and reinforced learning. Articulation of the Problem is very important. Whether it is a classification or a number problem or a just a clustering problem to solve. There are several subtypes of classification and regression.
Classification: How many categories to pick?
In classification, data is labelled meaning it is assigned a class, for example prepaid/postpaid or website visitors/non-website visitors. The decision being modelled is to assign labels to new unlabeled pieces of data.
Regression: How many numbers are the outputs,
Data is labelled with a real value (think floating point) rather than a label. Time series data like the price of a stock over time, the decision being modelled is what value to predict for new unpredicted data.
Clustering: Data is not labelled but can be divided into groups based on similarity and other measures of natural structure in the data. An example would be organizing customers by usage without names/identify, where the customer/user assign names to groups like Digital Natives, iPhoto on the Mac, Campaign hunters etc.
Rule Extraction: Data is used as the basis for the extraction of propositional rules (if-then). Such rules may, but are typically not directed, meaning that the methods discover statistically supportable relationships between attributes in the data, not necessarily involving something that is being predicted. An example is the discovery of the relationship between the purchase of air ticket and hotel booking (this is data mining folk-law, true or not, it’s illustrative of the desire and opportunity).
Once articulation problem has been solved, now it’s time to Identify Data Sources and provide answers to the following questions about the labels:
Identifying the data that the ML system should use to make predictions. Each row constitutes one piece of data for which one prediction is made. Only include information that is available at the moment of the prediction is made. Pick 1-3 inputs that are easy to obtain and that would produce be a reasonable, initial outcome.
Ability to Learn: Will the ML model be able to learn? List aspects of the problem that might cause difficulty learning. For example, the data set doesn't contain enough positive labels, the training data doesn't contain enough examples, the labels are too noisy, the system memorizes the training data, but has difficulty generalizing to new cases.
Think About Potential Bias: Many datasets are biased in some way. These biases may adversely affect training and the predictions made. For example, a biased data source may not translate across multiple contexts, the training sets may not be representative of the ultimate users of the models and may therefore provide them with a negative experience.