Hello there, I have a use case and I would like to know your opinion on it. What's your comprehensive approach about it. :) I have some idea but I would like to cross-check with yours.
Let me resume : There is a marketplace who let users submit a request to organise a group trip.
For each request, 5 carriers propose a quote. The quote will incorporate the transport (bus/minibus/car/ driver/charges and so on..)
So we have a system with two tables :
- Request Table (ID_Request/Number of people/Departure Date/Return Date/ Depart adress/Return Adress/type of travel)
- Quote (quote proposed by carriers) : ID_quote/ ID_demande/ Price_TTC
The marketplace wish to be able to identify the request price before submit this price to the carriers and to their client via their platform web.
The question is : How to set up a modele to estimate this price and put him in production ?
We need a process here, not code, just insight.
For me, first of all, we need to merge our tables, whatever the logiciel (sql/nosql) with ID_Demande, extract the data in .csv by exemple and after go for the cleaning and machine learning after.
Have you any ideas regarding the prerequisites statistics tests ? I'm a little lost here.
Regarding the possible models; I've some ideas (Linear Regression/Lasso/Ridge/ElasticNet/DecisionTreeRegressor/RandomForestRegressor and so on)
As for the metrics -> RMSE/R²/MAE/MAPE and so on
I'm a little lost regarding the stef for putting the model in production.
And for the languages, I guess Python/R/Julia and SQL/NoSQL could do the tricks ?
Any insight is most welcome :) Sorry for the long post, I wanted to be clear. Don't hesitate if I'm not !
P.S : If i'ts not the right place to ask this kind of things, don't hesitate to tell me where I need to go :)
I'd recommend that before you get into a question about the best tools, that you need to figure out the general nature of the question itself.
For instance, you don't know how much actual data you have. If each vendor has only one mode of transportation then this will likely almost be something you can solve as a set of linear equations or as a basic regression test. The second thing that's unclear is what specifically you are attempting to optimize for: lowest price fare for customer? highest price fare for vendor?
You don't know which features are significant and which are spurious, and there are suggestions that there are more features that are unidentified. As an example, does the fare price per passenger decrease as the number of passengers increase? Is there a discount (or penalty) when longer distances are taken into account?
You've also presented two tables, but there are no obvious common keys between the tables.
The point I'm trying to make here is that before you pull out any statistical tools, work out the model on paper. With large data sets you can get away with trying to infer behavior based upon statistical patterns, but my suspicion is that here you don't have enough data for that model to be accurate enough to be useful. Create a graph that identifiers the vectors and the weight associated with those vectors, and then and only then begin putting a plan into place to find the best statistical methods based upon the questions that you're asking.
It's a training, to see the kind of process I could build. I need to see the process for both : little data/big data.
But I agreed with you, this test is kind of abstract.
There's an adage in stochastic theory (with some mathematical basis) that the minimum reasonable sample size for a problem is in the neighborhood of 700 records - That seems to be the point where the intrinsic bias from taking a random sampling begins to settle down into something that is more stochastic in nature.
With Machine Learning, I'd push that up to perhaps 10,000 records, because what you are looking for there in general are enough data points that you can actually determine normal feature vectors and from that start reducing the feature set so that only the most prominent independent features can be ascertained. Remember at the end that you're basically trying to create a mesh or manifold that allows you to at least suggest the shape of an n-dimensional differential surface,in order to locate minima in that space to calibrate the appropriate coefficients.
Now, While I have a pretty decent background in stochastic theory (I was a Physics major in university), what I know is about thirty years out of date, though I'm slowly regaining currency. However, I think my advice is probably going to be the same as any of your advisors - understand the mathematics first, rather than getting hung up on algorithms and software, and even more, spend some time studying the domain. I had a friend once who created a very sophisticated mathematical model of tree growth patterns for determining long term lumber production. He did it in Excel because while his math was not that strong, he understood trees. I've always kept that in the back of my mind as an object lesson.
Everyone is talking about big data these days, but in fact, a lot of talk about it is too exaggerated. Employment data shows that big data seems to be needed by corporate recruiters. However, more data shows that companies do not know what to do with these big data professionals.
However, more important than big data itself is the analysis and big data automation management. And this trend is making a large number of automated server configuration system tools emerge. Puppet and others are the forces behind the "DevOps" trend.