Hello there, I have a use case and I would like to know your opinion on it. What's your comprehensive approach about it. :) I have some idea but I would like to cross-check with yours.

Let me resume : There is a marketplace who let users submit a request to organise a group trip.

For each request, 5 carriers propose a quote. The quote will incorporate the transport (bus/minibus/car/ driver/charges and so on..)

So we have a system with two tables :

- Request Table (ID_Request/Number of people/Departure Date/Return Date/ Depart adress/Return Adress/type of travel)

- Quote (quote proposed by carriers) : ID_quote/ ID_demande/ Price_TTC

The marketplace wish to be able to identify the request price before submit this price to the carriers and to their client via their platform web.

The question is : How to set up a modele to estimate this price and put him in production ?

We need a process here, not code, just insight.

For me, first of all, we need to merge our tables, whatever the logiciel (sql/nosql) with ID_Demande, extract the data in .csv by exemple and after go for the cleaning and machine learning after.

Have you any ideas regarding the prerequisites statistics tests ? I'm a little lost here.

Regarding the possible models; I've some ideas (Linear Regression/Lasso/Ridge/ElasticNet/DecisionTreeRegressor/RandomForestRegressor and so on)

As for the metrics -> RMSE/R²/MAE/MAPE and so on

I'm a little lost regarding the stef for putting the model in production.

And for the languages, I guess Python/R/Julia and SQL/NoSQL could do the tricks ?

Any insight is most welcome :) Sorry for the long post, I wanted to be clear. Don't hesitate if I'm not !

Thanks !

P.S : If i'ts not the right place to ask this kind of things, don't hesitate to tell me where I need to go :)

Views: 829

Reply to This

Replies to This Discussion

I'd recommend that before you get into a question about the best tools, that you need to figure out the general nature of the question itself.

For instance, you don't know how much actual data you have. If each vendor has only one mode of transportation then this will likely almost be something you can solve as a set of linear equations or as a basic regression test. The second thing that's unclear is what specifically you are attempting to optimize for: lowest price fare for customer? highest price fare for vendor?

You don't know which features are significant and which are spurious, and there are suggestions that there are more features that are unidentified. As an example, does the fare price per passenger decrease as the number of passengers increase? Is there a discount (or penalty) when longer distances are taken into account?

You've also presented two tables, but there are no obvious common keys between the tables.

The point I'm trying to make here is that before you pull out any statistical tools, work out the model on paper. With large data sets you can get away with trying to infer behavior based upon statistical patterns, but my suspicion is that here you don't have enough data for that model to be accurate enough to be useful. Create a graph that identifiers the vectors and the weight associated with those vectors, and then and only then begin putting a plan into place to find the best statistical methods based upon the questions that you're asking.

It's a training, to see the kind of process I could build. I need to see the process for both : little data/big data.

But I agreed with you, this test is kind of abstract.

There's an adage in stochastic theory (with some mathematical basis) that the minimum reasonable sample size for a problem is in the neighborhood of 700 records - That seems to be the point where the intrinsic bias from taking a random sampling begins to settle down into something that is more stochastic in nature.

With Machine Learning, I'd push that up to perhaps 10,000 records, because what you are looking for there in general are enough data points that you can actually determine normal feature vectors and from that start reducing the feature set so that only the most prominent independent features can be ascertained. Remember at the end that you're basically trying to create a mesh or manifold that allows you to at least suggest the shape of an n-dimensional differential surface,in order to locate minima in that space to calibrate the appropriate coefficients.

Now, While I have a pretty decent background in stochastic theory (I was a Physics major in university), what I know is about thirty years out of date, though I'm slowly regaining currency. However, I think my advice is probably going to be the same as any of your advisors - understand the mathematics first, rather than getting hung up on algorithms and software, and even more, spend some time studying the domain. I had a friend once who created a very sophisticated mathematical model of tree growth patterns for determining long term lumber production. He did it in Excel because while his math was not that strong, he understood trees. I've always kept that in the back of my mind as an object lesson.

Everyone is talking about big data these days, but in fact, a lot of talk about it is too exaggerated. Employment data shows that big data seems to be needed by corporate recruiters. However, more data shows that companies do not know what to do with these big data professionals.

However, more important than big data itself is the analysis and big data automation management. And this trend is making a large number of automated server configuration system tools emerge. Puppet and others are the forces behind the "DevOps" trend.

That is very interesting and yes its a training and you can follow this


Data has always been an integral part of disciplines. If we have a look at any business data direction has always been a task. Virtually all industries are digitizing, Nowadays. Additionally, data collection in sensors, weblogs, cellular devices, and instruments have increased in the current times. Believe it or not, there is a boom in new technologies which are emerging simply to organize this avalanche of information. With the support of Data science classes in pune, the experts are able to identify the patterns and regularities in all sorts of information which lets the corporate produce value. It will not be wrong to state that the information scientists are the future of this generation that is forthcoming.

First of all Seven Mentor is an old and trustworthy Institute ,with trainers having 15+ years of experience the course is formed in such a way that you can learn and become a skilled Data Scientist .Our Data science Course in Mumbaidelivers you hand on experience on live project and assured job placement as Data scientist.

I agree the point which you shared here, If we are not clear about how much data we have then it will be like solving the maths with some of the equation type. For easy way to solve the big data problems have a touch with basic maths it will be easier for you. At initial time I lag time for doing calculations then I completed the Data science Course from Learnbay there they teach me the easiest way to calculate with short cuts. Check youtube tutorials they can help you. For more guidelines go with institutes.

What is the definition of a data science use case? It's usually an inquiry or a supposition. Sometimes you're seeking for an answer, sometimes you're looking for an explanation, and sometimes you're just looking for confirmation that something is true. At the end of the day, it's all about a question that can be addressed with facts.

For data scientists, identifying possible use cases is deceptively simple. Everyone has a lot of ideas, but there are two types of bias that cause people to discard good ones.

They have a bias for items they are personally familiar with. Nobody, especially when it comes to data, has comprehensive visibility into the entire organization. There is simply too much information for everyone to comprehend.

Learn Data science certificate course for more growth and to get placed in highly paid company. I have completed my Online Data science course in Learnbay. This may help you.


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service