This is not very simple to choose a machine learning method and letting it go wild on the data. Particularly, understanding the core business problem and objective of the outcome and frame accordingly is one of the vital factors in machine learning. A general approach is difficult to recommend without intimate knowledge of the data. However, it sounds like we need to formalize the aspects of your model. Following questions may help to decide on machine learning problem or otherwise:
- What am I trying to predict? What are my outcomes?
- What data can I use to train my model, and what are my inputs? What market factors can I train my model with to predict the outcomes?
Machine learning comprises of supervised, unsupervised, semi- supervised and reinforced learning. Articulation of the Problem is very important. Whether it is a classification or a number problem or a just a clustering problem to solve. There are several subtypes of classification and regression.
Classification: How many categories to pick?
- Binary classification (yes/no)
- Multi-class classification (types of animal). For multi-class classification
- how many categories for a single example, Multi -class single level (Which type of animal is this)
- Multi-class multi-level (What are all the animal in this picture).
In classification, data is labelled meaning it is assigned a class, for example prepaid/postpaid or website visitors/non-website visitors. The decision being modelled is to assign labels to new unlabeled pieces of data.
Regression: How many numbers are the outputs,
- A uni-dimensional regression e.g. how many numbers of ticket booked
- A multi-dimensional regression e.g. what is the latitude and longitude of the location.
Data is labelled with a real value (think floating point) rather than a label. Time series data like the price of a stock over time, the decision being modelled is what value to predict for new unpredicted data.
Clustering: Data is not labelled but can be divided into groups based on similarity and other measures of natural structure in the data. An example would be organizing customers by usage without names/identify, where the customer/user assign names to groups like Digital Natives, iPhoto on the Mac, Campaign hunters etc.
Rule Extraction: Data is used as the basis for the extraction of propositional rules (if-then). Such rules may, but are typically not directed, meaning that the methods discover statistically supportable relationships between attributes in the data, not necessarily involving something that is being predicted. An example is the discovery of the relationship between the purchase of air ticket and hotel booking (this is data mining folk-law, true or not, it’s illustrative of the desire and opportunity).
Once articulation problem has been solved, now it’s time to Identify Data Sources and provide answers to the following questions about the labels:
- How much labeled data do I have?
- What is the source of my label?
- Is my label closely connected to the decision I am going to make?
Identifying the data that the ML system should use to make predictions. Each row constitutes one piece of data for which one prediction is made. Only include information that is available at the moment of the prediction is made. Pick 1-3 inputs that are easy to obtain and that would produce be a reasonable, initial outcome.
Ability to Learn: Will the ML model be able to learn? List aspects of the problem that might cause difficulty learning. For example, the data set doesn’t contain enough positive labels, the training data doesn’t contain enough examples, the labels are too noisy, the system memorizes the training data, but has difficulty generalizing to new cases.
Think About Potential Bias: Many datasets are biased in some way. These biases may adversely affect training and the predictions made. For example, a biased data source may not translate across multiple contexts, the training sets may not be representative of the ultimate users of the models and may therefore provide them with a negative experience.