Advices for use case in Data Science - Data Science Central2020-11-28T19:55:19Zhttps://www.datasciencecentral.com/forum/topics/advices-for-use-case-in-data-science?feed=yes&xn_auth=noThere's an adage in stochasti…tag:www.datasciencecentral.com,2020-11-15:6448529:Comment:10014772020-11-15T06:39:25.905ZKurt Caglehttps://www.datasciencecentral.com/profile/KurtCagle
<p>There's an adage in stochastic theory (with some mathematical basis) that the minimum reasonable sample size for a problem is in the neighborhood of 700 records - That seems to be the point where the intrinsic bias from taking a random sampling begins to settle down into something that is more stochastic in nature.</p>
<p>With Machine Learning, I'd push that up to perhaps 10,000 records, because what you are looking for there in general are enough data points that you can actually determine…</p>
<p>There's an adage in stochastic theory (with some mathematical basis) that the minimum reasonable sample size for a problem is in the neighborhood of 700 records - That seems to be the point where the intrinsic bias from taking a random sampling begins to settle down into something that is more stochastic in nature.</p>
<p>With Machine Learning, I'd push that up to perhaps 10,000 records, because what you are looking for there in general are enough data points that you can actually determine normal feature vectors and from that start reducing the feature set so that only the most prominent independent features can be ascertained. Remember at the end that you're basically trying to create a mesh or manifold that allows you to at least suggest the shape of an n-dimensional differential surface,in order to locate minima in that space to calibrate the appropriate coefficients.</p>
<p>Now, While I have a pretty decent background in stochastic theory (I was a Physics major in university), what I know is about thirty years out of date, though I'm slowly regaining currency. However, I think my advice is probably going to be the same as any of your advisors - understand the mathematics first, rather than getting hung up on algorithms and software, and even more, spend some time studying the domain. I had a friend once who created a very sophisticated mathematical model of tree growth patterns for determining long term lumber production. He did it in Excel because while his math was not that strong, he understood trees. I've always kept that in the back of my mind as an object lesson.</p> It's a training, to see the k…tag:www.datasciencecentral.com,2020-11-12:6448529:Comment:10013292020-11-12T11:58:59.956ZAlonsohttps://www.datasciencecentral.com/profile/Alonso612
<p>It's a training, to see the kind of process I could build. I need to see the process for both : little data/big data.</p>
<p></p>
<p>But I agreed with you, this test is kind of abstract.</p>
<p>It's a training, to see the kind of process I could build. I need to see the process for both : little data/big data.</p>
<p></p>
<p>But I agreed with you, this test is kind of abstract.</p> I'd recommend that before you…tag:www.datasciencecentral.com,2020-11-11:6448529:Comment:10011872020-11-11T21:03:08.676ZKurt Caglehttps://www.datasciencecentral.com/profile/KurtCagle
<p>I'd recommend that before you get into a question about the best tools, that you need to figure out the general nature of the question itself.</p>
<p>For instance, you don't know how much actual data you have. If each vendor has only one mode of transportation then this will likely almost be something you can solve as a set of linear equations or as a basic regression test. The second thing that's unclear is what specifically you are attempting to optimize for: lowest price fare for…</p>
<p>I'd recommend that before you get into a question about the best tools, that you need to figure out the general nature of the question itself.</p>
<p>For instance, you don't know how much actual data you have. If each vendor has only one mode of transportation then this will likely almost be something you can solve as a set of linear equations or as a basic regression test. The second thing that's unclear is what specifically you are attempting to optimize for: lowest price fare for customer? highest price fare for vendor?</p>
<p>You don't know which features are significant and which are spurious, and there are suggestions that there are more features that are unidentified. As an example, does the fare price per passenger decrease as the number of passengers increase? Is there a discount (or penalty) when longer distances are taken into account?</p>
<p>You've also presented two tables, but there are no obvious common keys between the tables.</p>
<p>The point I'm trying to make here is that before you pull out any statistical tools, work out the model on paper. With large data sets you can get away with trying to infer behavior based upon statistical patterns, but my suspicion is that here you don't have enough data for that model to be accurate enough to be useful. Create a graph that identifiers the vectors and the weight associated with those vectors, and then and only then begin putting a plan into place to find the best statistical methods based upon the questions that you're asking.</p>