Comments - Black-box Confidence Intervals: Excel and Perl Implementation - Data Science Central2019-09-18T17:43:33Zhttps://www.datasciencecentral.com/profiles/comment/feed?attachedTo=6448529%3ABlogPost%3A192573&xn_auth=noRegarding to shuffeling, how…tag:www.datasciencecentral.com,2016-06-20:6448529:Comment:4379842016-06-20T19:47:36.907ZSander Stepanovhttps://www.datasciencecentral.com/profile/SanderStepanov
<p>Regarding to shuffeling, how to insure the difference between bins is good (there is no similar bins, so bins are very different), if number of bins is 2^n , then hadamard matrix may be used.</p>
<p><a href="https://en.wikipedia.org/wiki/Hadamard_matrix" target="_blank">https://en.wikipedia.org/wiki/Hadamard_matrix</a></p>
<p>But if not...</p>
<p>I am very suprising , no body pays attention to this problem, or may be there is one ? </p>
<p>Regarding to shuffeling, how to insure the difference between bins is good (there is no similar bins, so bins are very different), if number of bins is 2^n , then hadamard matrix may be used.</p>
<p><a href="https://en.wikipedia.org/wiki/Hadamard_matrix" target="_blank">https://en.wikipedia.org/wiki/Hadamard_matrix</a></p>
<p>But if not...</p>
<p>I am very suprising , no body pays attention to this problem, or may be there is one ? </p> This has been re-posted (+/-)…tag:www.datasciencecentral.com,2016-02-13:6448529:Comment:3872252016-02-13T19:09:29.484ZDouglas A Damehttps://www.datasciencecentral.com/profile/DouglasDame
<p>This has been re-posted (+/-), and so brought back to my attention ... and my immediate reaction was "why bother (doing this by a new method"). Which my immediate reaction the first time around too. So this time I will post some thoughts.</p>
<p>If one is doing work for other people, often/usually you do a walk-through of the methodology/data/limitations, before you get to the fun stuff, aka the Results. The Results are where you want to spend your time, and it's where the customer wants to…</p>
<p>This has been re-posted (+/-), and so brought back to my attention ... and my immediate reaction was "why bother (doing this by a new method"). Which my immediate reaction the first time around too. So this time I will post some thoughts.</p>
<p>If one is doing work for other people, often/usually you do a walk-through of the methodology/data/limitations, before you get to the fun stuff, aka the Results. The Results are where you want to spend your time, and it's where the customer wants to spend time. The time it takes to explain "C.I.s by binning" is a diversion. Moreover, it brings in two risks: the low quant part of the audience may not understand it (given your fast description), and thus begin to have lingering doubts about your work, whereas the high quant part of your audience will understand it, but possibly not agree with your choice to use an unconventional approach when others are readily available, and thus begin to have lingering doubt about your work.</p>
<p></p>
<p>So I see downsides of using this approach, but no compelling upside. The binning method means that each observation can be in one and only one bin. So it's sampling without replacement. I'm not the greatest of theoretical statisticians but I have it in mind that sampling with replacement is generally accepted as more likely to root out the true variability of the data in hand. Data that lives in Excel by my definition cannot be "Big Data," and it's easy to kick it out to R or some other s/w that makes bootstrapping easy.</p>
<p>Why is this preferable to doing an empirical estimation of confidence intervals by bootstrap resampling, or monte carlo resampling? (For estimating CI's, I would say this difference between the two is that bootstrapping grabs samples with replacement that approximate the size of the original sample, while monte carlo would tend to use smaller samples.) </p>
<p>I can see where, for really really big data, binning in one pass could be materially faster than bootstrapping. (Especially if you eliminate the need to do a randomized sort of all the data before assigning bins.) But is the time saved worth it for datasets of moderate size, where a bootstrapping or monte carlo approach of say 1000 iterations might take only take a couple of minutes at worst case ?</p>
<p></p>
<p>I work with health care data, none of which ever seems to have a normal distribution, and routinely use the monte carlo approach to put a range of fuzziness on what would otherwise be point estimates. It's not at all difficult, and not very time-coming, measured in either human or computer effort. </p>
<p></p>
<p>Please help me understand under what conditions this binning approach would be a better choice to use.</p>
<p></p> This is just a terminology is…tag:www.datasciencecentral.com,2015-09-21:6448529:Comment:3256552015-09-21T07:07:27.341ZDmitry Petrovhttps://www.datasciencecentral.com/profile/DmitryLPetrov
<p>This is just a terminology issue. When I present a result I cannot say this is "confidence interval" because of people assume this as a regular statistical CI. Nor I don't want to explain this methods to an audience. I'd prefer to calculate CI using a regular method for a presentation.</p>
<p>This method is good at practice - fast iterating through hypothesis and data sets. There is no doubting the fact that data scientists need more methods like this.</p>
<p>This is just a terminology issue. When I present a result I cannot say this is "confidence interval" because of people assume this as a regular statistical CI. Nor I don't want to explain this methods to an audience. I'd prefer to calculate CI using a regular method for a presentation.</p>
<p>This method is good at practice - fast iterating through hypothesis and data sets. There is no doubting the fact that data scientists need more methods like this.</p> Dmitry and Thomas: How can yo…tag:www.datasciencecentral.com,2015-09-20:6448529:Comment:3255342015-09-20T22:09:18.554ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p>Dmitry and Thomas: How can you claim it's wrong when it produces the exact same results as "standard" theory, asymptotically, for normal distributions? Actually it is easy to explain why it produces the same results: you just have to use some limit theorems to prove the validity.</p>
<p>Dmitry and Thomas: How can you claim it's wrong when it produces the exact same results as "standard" theory, asymptotically, for normal distributions? Actually it is easy to explain why it produces the same results: you just have to use some limit theorems to prove the validity.</p> I agree that this approach is…tag:www.datasciencecentral.com,2015-09-20:6448529:Comment:3256302015-09-20T21:33:16.849ZDmitry Petrovhttps://www.datasciencecentral.com/profile/DmitryLPetrov
<p>I agree that this approach is not correct form a statistical point of view.<br></br><br></br>Why this method can be interesting? Because it is easy to implement in SQL\NoSQL\Hadoop and run on top of terabytes data set. I use one simple hypothesis testing method (population proportion) which is not "statistically correct" but easy to implement in SQL\NoSQL. I use the method a lot and probably I should write a blog post about it.<br></br><br></br>These "not correct" methods are good in Ad Hoc analytics…</p>
<p>I agree that this approach is not correct form a statistical point of view.<br/><br/>Why this method can be interesting? Because it is easy to implement in SQL\NoSQL\Hadoop and run on top of terabytes data set. I use one simple hypothesis testing method (population proportion) which is not "statistically correct" but easy to implement in SQL\NoSQL. I use the method a lot and probably I should write a blog post about it.<br/><br/>These "not correct" methods are good in Ad Hoc analytics projects with large data sets. In these projects you should iterate through many hypotheses. The iteration time is crucial. When problem is localized and dataset is reduced I use correct statistical methods.</p> Dear Dr Granville,
This artic…tag:www.datasciencecentral.com,2015-04-20:6448529:Comment:2686632015-04-20T16:40:39.374ZThomas Gerbaudhttps://www.datasciencecentral.com/profile/ThomasGerbaud
<p>Dear Dr Granville,</p>
<p>This article is more than 6 months old.<br/>And totally wrong, from a statistical point of view.</p>
<p><br/>BTW I do love the "rebel statistical science" you are referring to, it made my day.<br/><br/>Best<br/>T</p>
<p><br/><br/></p>
<p>Dear Dr Granville,</p>
<p>This article is more than 6 months old.<br/>And totally wrong, from a statistical point of view.</p>
<p><br/>BTW I do love the "rebel statistical science" you are referring to, it made my day.<br/><br/>Best<br/>T</p>
<p><br/><br/></p> I agree, Peter. The best prac…tag:www.datasciencecentral.com,2014-08-12:6448529:Comment:1936282014-08-12T14:03:21.797ZKhurram Nadeemhttps://www.datasciencecentral.com/profile/KhurramNadeem
<p>I agree, Peter. The best practice is to use a tool for the purpose it is actually designed for. Progress will come from learning the deficiencies in the existing tools and making improvements in the same spirit as Data Science Central is advocating for.</p>
<p>I agree, Peter. The best practice is to use a tool for the purpose it is actually designed for. Progress will come from learning the deficiencies in the existing tools and making improvements in the same spirit as Data Science Central is advocating for.</p> Hi Khurram,
Complexities will…tag:www.datasciencecentral.com,2014-08-12:6448529:Comment:1937012014-08-12T13:56:08.718ZPeter Vijnhttps://www.datasciencecentral.com/profile/PeterVijn
<p>Hi Khurram,</p>
<p>Complexities will kick in, as skewness will induce heteroscedasticity. No problem when using the right model but creating chaos when left unaddressed. Time series data will raise challenges like non-stationarity. Vincent's approach is refreshing but just scratching the surface. I'm new to Data Science Central but am happy watching at the sideline for now. My point here is that we should not throw away the achievements of classic descriptive statistics and modelling just…</p>
<p>Hi Khurram,</p>
<p>Complexities will kick in, as skewness will induce heteroscedasticity. No problem when using the right model but creating chaos when left unaddressed. Time series data will raise challenges like non-stationarity. Vincent's approach is refreshing but just scratching the surface. I'm new to Data Science Central but am happy watching at the sideline for now. My point here is that we should not throw away the achievements of classic descriptive statistics and modelling just because classic inferential statistics (P-values) is under attack. The benefit is in joining the old and the new...</p> Thanks Peter for highlighting…tag:www.datasciencecentral.com,2014-08-12:6448529:Comment:1934822014-08-12T13:33:50.311ZKhurram Nadeemhttps://www.datasciencecentral.com/profile/KhurramNadeem
<p>Thanks Peter for highlighting important points here. I also had the same impression as indicated <a href="https://www.linkedin.com/groups/Blackbox-Confidence-Intervals-Excel-Perl-4066593.S.5904352356828463108?view=&gid=4066593&type=member&item=5904352356828463108&trk=eml-group_discussion_new_comment-respond-btn" target="_blank">here</a>. Also, as for the comment that "<em>Our methodology is better </em><span><em>when n (the number of observations) is small (n < 100), or…</em></span></p>
<p>Thanks Peter for highlighting important points here. I also had the same impression as indicated <a href="https://www.linkedin.com/groups/Blackbox-Confidence-Intervals-Excel-Perl-4066593.S.5904352356828463108?view=&gid=4066593&type=member&item=5904352356828463108&trk=eml-group_discussion_new_comment-respond-btn" target="_blank">here</a>. Also, as for the comment that "<em>Our methodology is better </em><span><em>when n (the number of observations) is small (n < 100), or for high confidence levels (> 0.98) or when your data has outliers</em>", it is irrelevant for the big data problems where sample size is large. The central limit theorem will kick in for moderately large data sets, rendering the Gaussian CIs robust against outliers as well. However, If the task is to obtain prediction intervals for new data, one can rely on nonparametric density estimation techniques. Nonetheless, it will be interesting to see how this method compares with existing approaches when data arises from highly skewed multimodel populations.</span></p> To really feed your method wi…tag:www.datasciencecentral.com,2014-08-12:6448529:Comment:1933892014-08-12T11:09:53.928ZPeter Vijnhttps://www.datasciencecentral.com/profile/PeterVijn
<p>To really feed your method with some realistically simulated data I suggest to generate lognormal data with a range of different sigma's and see how you compete with the traditional techniques. A good link to theory, application and generality of the lognormal is <a href="http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf" target="_blank">here</a>.</p>
<p>To really feed your method with some realistically simulated data I suggest to generate lognormal data with a range of different sigma's and see how you compete with the traditional techniques. A good link to theory, application and generality of the lognormal is <a href="http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf" target="_blank">here</a>.</p>