Data Science Central2019-05-26T07:47:49ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwanihttps://api.ning.com/files/psBcVTYIWDjIfy*gvJusemz2t6CKv3QhBbfA*X00z7FlbH1EU2AmXUaE2lElqcWB92lKK9ZDoqh6t8GHLagVD2gLmMhPbApz/1979148387.png?xgip=0%3A7%3A249%3A249%3B%3B&profile=RESIZE_180x180&width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/forum/topic/listForContributor?user=0ykouy80vk2q2&feed=yes&xn_auth=noRecommendation system evaluationtag:www.datasciencecentral.com,2019-05-24:6448529:Topic:8298852019-05-24T22:41:11.913ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>I have created a recommendation system using KNN, is there any way to evaluate/validate the system?</p>
<p>I have created a recommendation system using KNN, is there any way to evaluate/validate the system?</p> Statistical significance match with multi-variablestag:www.datasciencecentral.com,2019-05-24:6448529:Topic:8301082019-05-24T21:41:40.198ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>I need a test to match ratios of stable isotope proportions from one sample to another. In particular lead, with 4 stable isotopes and zinc with 5 isotopes with some degree of significance, or positive match. I am matching ecological indicator results with possible sources of contamination.</p>
<p>I need a test to match ratios of stable isotope proportions from one sample to another. In particular lead, with 4 stable isotopes and zinc with 5 isotopes with some degree of significance, or positive match. I am matching ecological indicator results with possible sources of contamination.</p> Scanned documents OCR data cleaningtag:www.datasciencecentral.com,2019-05-24:6448529:Topic:8297222019-05-24T09:25:10.926ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>Hi! I am reading data from scanned medical documents (provider Notes) using Pytesseract OCR. The resultant data has some noise and misspells. My ultimate goal is to extract useful medical information from data. Right now I'm stuck with how to correct both medical and English misspells. I have to create a dictionary which contains both medical and English words. I'm looking for direction on what steps I need to perform.</p>
<p>Hi! I am reading data from scanned medical documents (provider Notes) using Pytesseract OCR. The resultant data has some noise and misspells. My ultimate goal is to extract useful medical information from data. Right now I'm stuck with how to correct both medical and English misspells. I have to create a dictionary which contains both medical and English words. I'm looking for direction on what steps I need to perform.</p> Traffic/commute data and processing related questiontag:www.datasciencecentral.com,2019-05-23:6448529:Topic:8296392019-05-23T19:39:14.143ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>Hello All,</p>
<p>I am very very new to big data. I am exploring a commercial solution to traffic/congestion problem. To that effect, I am looking for your tips and suggestions on where I can find the following</p>
<p>1) Raw anonymized/non-personal location tracking data on cell-phone subscribers. This is to understand traffic/commute patters. Ideally this data should be resolvable to a commute starting address and a commute ending address. I think cellular network carriers sell this data.…</p>
<p>Hello All,</p>
<p>I am very very new to big data. I am exploring a commercial solution to traffic/congestion problem. To that effect, I am looking for your tips and suggestions on where I can find the following</p>
<p>1) Raw anonymized/non-personal location tracking data on cell-phone subscribers. This is to understand traffic/commute patters. Ideally this data should be resolvable to a commute starting address and a commute ending address. I think cellular network carriers sell this data. Not sure how to get samples of this. I also understand Census survey contains some of this this data but to the best of my understanding it only has city of residence and city of work so it is somewhat non-specific.</p>
<p></p>
<p>2) Any pointers on tools/algorithms to resolve what would be I suppose GPS coordinate data to shortest distance and shortest time routes that are typically seen in map guidance systems. I am assuming that technology is common enough that there should be standard algorithms/tools for it.</p>
<p></p>
<p>Again, I appreciate any suggestions and pointers in advance.</p>
<p></p> Moments of Order Statisticstag:www.datasciencecentral.com,2019-05-23:6448529:Topic:8296062019-05-23T14:53:08.513ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2663469045?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2663469045?profile=RESIZE_710x" class="align-center"/></a></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2663469045?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2663469045?profile=RESIZE_710x" class="align-center"/></a></p> Data science audio book for laypersontag:www.datasciencecentral.com,2019-05-23:6448529:Topic:8293492019-05-23T14:29:37.459ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>Please suggest an audio book for data science for a beginner with less stats, less linear algebra and almost no programming background.</p>
<p>Please suggest an audio book for data science for a beginner with less stats, less linear algebra and almost no programming background.</p> Question about the big O notationtag:www.datasciencecentral.com,2019-05-23:6448529:Topic:8290942019-05-23T00:37:39.542ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>We all know that exponential functions grow faster than polynomials. Let us consider the following function: f(<em>n</em>) = <em>n</em>^<em>a</em> ⋅ (log <em>n</em>)^<i>b</i> ⋅ (log log <em>n</em>)^c ⋅ (log log log <em>n</em>)^d⋯ where the leading coefficient is positive.</p>
<p>I think anything that is "slowly growing" has this type of asymptotic expansion. In short, this type of representation is "complete": there is nothing between <em>n</em> and <em>n⋅</em> log <em>n</em> other than a…</p>
<p>We all know that exponential functions grow faster than polynomials. Let us consider the following function: f(<em>n</em>) = <em>n</em>^<em>a</em> ⋅ (log <em>n</em>)^<i>b</i> ⋅ (log log <em>n</em>)^c ⋅ (log log log <em>n</em>)^d⋯ where the leading coefficient is positive.</p>
<p>I think anything that is "slowly growing" has this type of asymptotic expansion. In short, this type of representation is "complete": there is nothing between <em>n</em> and <em>n⋅</em> log <em>n</em> other than a member of the above class, for instance <em>n</em> ⋅ (log <em>n</em>)^1/2 ⋅ (log log log <em>n</em>)^5 / (log log <em>n</em>)^3.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2652532828?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2652532828?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Is my statement true? How can one make it rigorous, assuming it is correct? Also,</p>
<ul>
<li>What would be the slowest growing function that grows faster than the fastest growing function in the above class?</li>
<li>Provide an example of function growing faster than the fastest growing function in that class, but more slowly than exp <em>n</em>.</li>
<li>What happens if you allow the coefficients not to be bounded, and the sequence to be infinite? For instance consider f(<em>n</em>) = <em>n</em> ⋅ (log <em>n</em>)^2 ⋅ (log log <em>n</em>)^4 ⋅ (log log log <em>n</em>)^8⋯</li>
</ul>
<p>In short, is there some kind of topological framework that handles manipulations over these functions? Indeed, they are not functions, but asymptotic representations or quantities instead. It's a different type of mathematical objects, with its own arithmetic and topology.</p> Unrelated dimensions with common target valuetag:www.datasciencecentral.com,2019-05-19:6448529:Topic:8274982019-05-19T14:59:55.891ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p><span>The problem is we have many unrelated dimensions with common target value. We want to build a formula that predicts this target variable (continuous numeric).</span></p>
<p><span>Is there any way to do a kind of ansamble model to predict the value? Maybe some weighted formula can help?</span></p>
<p><span>Thanks</span></p>
<p><span>The problem is we have many unrelated dimensions with common target value. We want to build a formula that predicts this target variable (continuous numeric).</span></p>
<p><span>Is there any way to do a kind of ansamble model to predict the value? Maybe some weighted formula can help?</span></p>
<p><span>Thanks</span></p> Advanced big data algorithmstag:www.datasciencecentral.com,2019-05-15:6448529:Topic:8255072019-05-15T09:13:48.726ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p><span style="font-size: 12pt;">Recently, I have finished my book on probabilistic data structures and noticed that even they are used almost in every well-known product that works with data (<em>Apache Spark, Elasticsearch, Redis, CouchDB, Cassandra</em>, ....), there are not so many developers and software engineers actually aware of them. However, their base principles are very simple and amazingly smart.…</span></p>
<p></p>
<p><span style="font-size: 12pt;">Recently, I have finished my book on probabilistic data structures and noticed that even they are used almost in every well-known product that works with data (<em>Apache Spark, Elasticsearch, Redis, CouchDB, Cassandra</em>, ....), there are not so many developers and software engineers actually aware of them. However, their base principles are very simple and amazingly smart.</span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2560762961?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2560762961?profile=RESIZE_710x" width="400" class="align-center" style="padding: 10px;"/></a></p>
<p><span style="font-size: 12pt;">The problem is that they are treated as too "advanced" and have a very specific use case - when you need to handle the real "big data", not just the data that crashes your MS Excel. There are not so many books about them, the original publications are so hard to read, and you</span> <span style="font-size: 12pt;">unlikely find them in base courses in universities or online. But f<span>rom my point of view, they are so cool even for Big Data Interview process, when you want to find really smart and experienced employees.</span></span></p>
<p></p>
<p><span style="font-size: 12pt;"><strong>Let me explain them a bit further to warm up your interest to learn them. </strong></span></p>
<p></p>
<p><span style="font-size: 16px;">When dealing with huge data, the first thing that came in mind, let's use <strong>parallel processing</strong>, allocate many independent workers and use, for instance, MapReduce. At some point, when our system handles more and more data, we can recognize that we cannot scale our infrastructure further - either it becomes too big and unmanageable, or slow, or we have no money to pay for new servers or all together.</span></p>
<p></p>
<p><span style="font-size: 16px;">At this point, the usual way to go is to use <strong>sampling</strong> - to process only a subset of the data, skip the rest, and hope that our subset is representative enough to interpolate obtained results to the whole dataset. The advantage of this approach is that we still use our system and the same algorithms, nothing changes, except the amount for data.</span></p>
<p></p>
<p><span style="font-size: 16px;">But what if we cannot ignore any data. Imaging, we need to count the number of unique elements, get the most frequent items, or just remember every element to check new elements for existence lately? In such cases, and this is already "advanced" decision that is not used so often, we can use <strong>hashing</strong> - compute a short representation of our elements, making them smaller. The pure hashing doesn't help so much, since often the problem in algorithms, that require at least linear memory, many passed through the data, or have polynomial time to complete.</span></p>
<p></p>
<p><span style="font-size: 16px;">If you come to this point, you are ready to learn <strong>probabilistic data structures and algorithms</strong>, which are optimized to use fixed or sublinear memory and constant execution time, and have many other interesting features. They are based on the hashing and don't provide you with an exact answer, but can guarantee an acceptable error.</span></p>
<p></p>
<p><span style="font-size: 16px;">Let me point some examples here: <em>Bloom FIlter</em>, <em>Quotient FIlter</em>, <em>Count-Min Sketch</em>, <em>HyperLogLog</em>, t-<em>digest</em>, <em>minhash</em>, and many others.</span></p> DS career advice for experienced MBA with marketing analytics focus?tag:www.datasciencecentral.com,2019-05-09:6448529:Topic:8242162019-05-09T22:07:22.631ZRohan Kotwanihttps://www.datasciencecentral.com/profile/RohanKotwani
<p>I have an MBA in Marketing (U of Minnesota), 20+ years' marketing experience in CPG, an independent consultant for the past 7 years. I have solid business stats skills (including multiple/logistic regression, matrices, segmentation), and have been extending my skillset into R, SQL, Python, Tableau. I'd like to either focus my practice more into Marketing Data Science or return to a full-time role in that area - so, my question is whether I need the accreditation of a Master's in Data…</p>
<p>I have an MBA in Marketing (U of Minnesota), 20+ years' marketing experience in CPG, an independent consultant for the past 7 years. I have solid business stats skills (including multiple/logistic regression, matrices, segmentation), and have been extending my skillset into R, SQL, Python, Tableau. I'd like to either focus my practice more into Marketing Data Science or return to a full-time role in that area - so, my question is whether I need the accreditation of a Master's in Data Analytics/Stats/similar or whether I can leverage my MBA and simply continuing building my own skills as I have done? Is there true value in that 2nd Masters, or can I demonstrate my proficiency by achieving industry credentials/certifications? Thank you for any and all counsel you can provide.</p>