]]>

Monday newsletter published by Data Science Central. Previous editions can be found here. The contribution flagged with a + is our selection for the picture of the week.AnnouncementMaximize your job prospects. Earn a Penn State Master’s in Data Analytics part-time and online. Learn to design and apply data management techniques to solve business problems involving high volumes of structured and unstructured data. Strive toward a leadership role in the field of data science. GRE/GMAT waivers available. Improve your marketability -- apply now for spring 2019. See details.Featured Resources and Technical Contributions Introduction to Deep Learning Free Book: Process Improvement Using Data Free Book: Introduction to Statistics Data Science Glossary An overview of feature selection strategies Comparing Machine Learning as a Service - Amazon, Azure, Google Cloud AI, WatsonConversational UI is our Future A Study of Reddit Politics Question: Hyperlinks not working in Excel Question: Choosing the best free IAOPs suite tools NLP: Video Tutorial Featured ArticlesStatistical Significance and p-Values Take Another Blow The Fourth Way to Practice Data Science – Purpose Built Analytic Modules100 Day Plan for a new Chief Data Analytics Officer +How do you know if you’re getting value from your data? Aggregated Data Dilemma Defining AI Not as Important as Exploiting AI Applying Noise Reduction to Stock Market Data Helping Non-Profit Organizations as a Data Scientist Wrongness of the Nogs What is the difference between Machine Learning and Artificial Intelligence? Data Science vs AI: Get to the Fundamentals Picture of the WeekSource for picture: contribution marked with a + From our SponsorsMake a Data-Driven DecisionNext Generation BI: AI-Powered AnalyticsMachine Learning: How to Build a Model from ScratchData Prep: A Better, Faster WayHow to Avoid Data Lake Failure: GartnerFuture-Proof Your Analytics StackOnline M.S. in Applied Data Science From SyracuseSelf-Service Data Prep: Ovum Decision MatrixDeep Learning - Training your Neural Network4 Ways to Tackle Common Data Prep IssuesFree Book: Applied Stochastic Processes (members only)Math for ML: Open Doors to Data Science and AI (eBook)To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com to your address book or whitelist us. Hire a Data Scientist | Search DSC | Classifieds | Find a Job | Post a Blog | Ask a QuestionSee More

Full title: Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.This book is available for Data Science Central members exclusively. The text in blue consists of clickable links to provide the reader with additional references. Source code and Excel spreadsheets summarizing computations, are also accessible as hyperlinks for easy copy-and-paste or replication purposes. The most recent version of this book is available from this link, accessible to DSC members only. About the authorVincent Granville is a start-up entrepreneur, patent owner, author, investor, pioneering data scientist with 30 years of corporate experience in companies small and large (eBay, Microsoft, NBC, Wells Fargo, Visa, CNET) and a former VC-funded executive, with a strong academic and research background including Cambridge University.Download the book (members only) Click here to get the book. For Data Science Central members only. If you have any issues accessing the book please contact us at info@datasciencecentral.com.ContentThe book covers the following topics: 1. Introduction to Stochastic ProcessesWe introduce these processes, used routinely by Wall Street quants, with a simple approach consisting of re-scaling random walks to make them time-continuous, with a finite variance, based on the central limit theorem.Construction of Time-Continuous Stochastic ProcessesFrom Random Walks to Brownian MotionStationarity, Ergodicity, Fractal BehaviorMemory-less or Markov PropertyNon-Brownian Process2. Integration, Differentiation, Moving AveragesWe introduce more advanced concepts about stochastic processes. Yet we make these concepts easy to understand even to the non-expert. This is a follow-up to Chapter 1.Integrated, Moving Average and Differential ProcessProper Re-scaling and Variance ComputationApplication to Number Theory Problem3. Self-Correcting Random WalksWe investigate here a breed of stochastic processes that are different from the Brownian motion, yet are better models in many contexts, including Fintech. Controlled or Constrained Random WalksLink to Mixture Distributions and ClusteringFirst Glimpse of Stochastic Integral EquationsLink to Wiener Processes, Application to FintechPotential Areas for ResearchNon-stochastic Case4. Stochastic Processes and Tests of RandomnessIn this transition chapter, we introduce a different type of stochastic process, with number theory and cryptography applications, analyzing statistical properties of numeration systems along the way -- a recurrent theme in the next chapters, offering many research opportunities and applications. While we are dealing with deterministic sequences here, they behave very much like stochastic processes, and are treated as such. Statistical testing is central to this chapter, introducing tests that will be also used in the last chapters.Gap Distribution in Pseudo-Random DigitsStatistical Testing and Geometric DistributionAlgorithm to Compute GapsAnother Application to Number Theory ProblemCounter-Example: Failing the Gap Test5. Hierarchical ProcessesWe start discussing random number generation, and numerical and computational issues in simulations, applied to an original type of stochastic process. This will become a recurring theme in the next chapters, as it applies to many other processes.Graph Theory and Network ProcessesThe Six Degrees of Separation ProblemProgramming Languages Failing to Produce Randomness in SimulationsHow to Identify and Fix the Previous IssueApplication to Web Crawling6. Introduction to Chaotic SystemsWhile typically studied in the context of dynamical systems, the logistic map can be viewed as a stochastic process, with an equilibrium distribution and probabilistic properties, just like numeration systems (next chapters) and processes introduced in the first four chapters.Logistic Map and FractalsSimulation: Flaws in Popular Random Number GeneratorsQuantum Algorithms7. Chaos, Logistic Map and Related ProcessesWe study processes related to the logistic map, including a special logistic map discussed here for the first time, with a simple equilibrium distribution. This chapter offers a transition between chapter 6, and the next chapters on numeration system (the logistic map being one of them.)General FrameworkEquilibrium Distribution and Stochastic Integral EquationExamples of Chaotic SequencesDiscrete, Continuous Sequences and GeneralizationsSpecial Logistic MapAuto-regressive Time SeriesLiteratureSource Code with Big Number LibrarySolving the Stochastic Integral Equation: Example8. Numerical and Computational IssuesThese issues have been mentioned in chapter 7, and also appear in chapters 9, 10 and 11. Here we take a deeper dive and offer solutions, using high precision computing with BigNumber libraries. Precision Issues when Simulating, Modeling, and Analyzing Chaotic ProcessesWhen Precision Matters, and when it does notHigh Precision Computing (HPC)Benchmarking HPC SolutionsHow to Assess the Accuracy of your Simulation Tool9. Digits of Pi, Randomness, and Stochastic ProcessesDeep mathematical and data science research (including a result about the randomness of Pi, which is just a particular case) are presented here, without using arcane terminology or complicated equations. Numeration systems discussed here are a particular case of deterministic sequences behaving just like the stochastic process investigated earlier, in particular the logistic map, which is a particular case.Application: Random Number GenerationChaotic Sequences Representing NumbersData Science and Mathematical EngineeringNumbers in Base 2, 10, 3/2 or PiNested Square Roots and Logistic MapAbout the Randomness of the Digits of PiThe Digits of Pi are Randomly Distributed in the Logistic Map SystemPaths to Proving Randomness in the Decimal SystemConnection with Brownian MotionsRandomness and the Bad Seeds ParadoxApplication to Cryptography, Financial Markets, Blockchain, and HPCDigits of Pi in Base Pi10. Numeration Systems in One PictureHere you will find a summary of much of the material previously covered on chaotic systems, in the context of numeration systems (in particular, chapters 7 and 9.)Summary Table: Equilibrium Distribution, PropertiesReverse-engineering Number Representation SystemsApplication to Cryptography11. Numeration Systems: More Statistical Tests and ApplicationsIn addition to featuring new research results and building on the previous chapters, the topics discussed here offer a great sandbox for data scientists and mathematicians. Components of Number Representation SystemsGeneral Properties of these SystemsExamples of Number Representation SystemsExamples of Patterns in Digits DistributionDefects found in the Logistic Map SystemTest of UniformityNew Numeration System with no Bad SeedHoles, Autocorrelations, and Entropy (Information Theory)Towards a more General, Better, Hybrid SystemFaulty Digits, Ergodicity, and High Precision ComputingFinding the Equilibrium Distribution with the Percentile TestCentral Limit Theorem, Random Walks, Brownian Motions, Stock Market ModelingData Set and Excel Computations12. The Central Limit Theorem RevisitedThe central limit theorem explains the convergence of discrete stochastic processes to Brownian motions, and has been cited a few times in this book. Here we also explore a version that applies to deterministic sequences. Such sequences and treated as stochastic processes in this book.A Special Case of the Central Limit TheoremSimulations, Testing, and ConclusionsGeneralizationsSource Code13. How to Detect if Numbers are Random or NotWe explore here some deterministic sequences of numbers, behaving like stochastic processes or chaotic systems, together with another interesting application of the central limit theorem.Central Limit Theorem for Non-Random VariablesTesting Randomness: Max Gap, Auto-Correlations and MorePotential Research AreasGeneralization to Higher Dimensions14. Arrival Time of Extreme Events in Time SeriesTime series, as discussed in the first chapters, are also stochastic processes. Here we discuss a topic rarely investigated in the literature: the arrival times, as opposed to the extreme values (a classic topic), associated with extreme events in time series.SimulationsTheoretical Distribution of Records over Time15. Miscellaneous TopicsWe investigate topics related to time series as well as other popular stochastic processes such as spatial processes.How and Why: Decorrelate Time SeriesA Weird Stochastic-Like, Chaotic SequenceStochastic Geometry, Spatial Processes, Random Circles: Coverage ProblemAdditional Reading (Including Twin Points in Point Processes)16. ExercisesSee More

]]>

]]>

]]>

]]>

I read an article this morning, about a top Cornell food researcher having 13 studies retracted, see here. It prompted me to write this blog. It is about data science charlatans and unethical researchers in the Academia, destroying the value of p-values again, using a well known trick called p-hacking, to get published in top journals and get grant money or tenure. The issue is widespread, not just in academic circles, and make people question the validity of scientific methods. It fuels the fake "theories" of those who have lost faith in science. The trick consists of repeating an experiment sufficiently many times, until the conclusions fit with your agenda. Or by being cherry-picking about the data you use, or even discarding observations deemed to have a negative impact on conclusions. Sometimes, causation and correlations are mixed up on purpose, or misleading charts are displayed. Sometimes, the author lacks statistical acumen. Usually, these experiments are not reproducible. Even top journals sometimes accept these articles, due toPoor peer-review processIncentives to publish sensational material By contrast, research that is aimed at finding the truth, sometimes does not use p-values nor classical tests of hypotheses. For instance, my recent article comparing whether two types of distributions are identical, does not rely on these techniques. Also the theoretical answer is know, so I would be lying to myself by showing results that fit with my gut feelings or intuition. In some of my tests, I clearly state that my sample size is too small to make a conclusion. And the presentation style is simple so that non-experts can understand it. Finally, I share my data and all the computations. You can read that article here. I hope it will inspire those interested in doing sound analyses.Below are some extracts from the article I read this morning:Some of the retracted papers include studies suggesting people who grocery shop hungry buy more calories; that preordering lunch can help you choose healthier food; and that serving people out of large bowls encourage them to serve themselves larger portions. Not that the conclusions are necessarily wrong, but because these studies are based on questionable data and misuse of statistical techniques. Below are some extracts from the article that reported the issue.P-values of .05 aren’t that hard to find if you sort the data differently or perform a huge number of analyses. In flipping coins, you’d think it would be rare to get 10 heads in a row. You might start to suspect the coin is weighted to favor heads and that the result is statistically significant. But what if you just got 10 heads in a row by chance (it can happen) and then suddenly decided you were done flipping coins? If you kept going, you’d stop believing the coin is weighted.Stopping an experiment when a p-value of .05 is achieved is an example of p-hacking. But there are other ways to do it — like collecting data on a large number of outcomes but only reporting the outcomes that achieve statistical significance. By running many analyses, you’re bound to find something significant just by chance alone.There’s a movement of scientists who seek to rectify practices in science like the ones that Wansink is accused of. Together, they basically call for three main fixes that are gaining momentum.Preregistration of study designs: This is a huge safeguard against p-hacking. Preregistration means that scientists publicly commit to an experiment’s design before they start collecting data. This makes it much harder to cherry-pick results.Open data sharing: Increasingly, scientists are calling on their colleagues to make all the data from their experiments available for anyone to scrutinize (there are exceptions, of course, for particularly sensitive information). This ensures that shoddy research that makes it through peer review can still be double-checked.Registered replication reports: Scientists are hungry to see if previously reporting findings in the academic literature hold up under more intense scrutiny. There are many efforts underway to replicate (exactly or conceptually) research findings with rigor.There are other potential fixes too: There’s a group of scientists calling for a stricter definition of statistically significant. Others argue that arbitrary cutoffs for significance are always going to be gamed. And increasingly, scientists are turning to other forms of mathematical analysis, such as Bayesian statistics, which asks a slightly different question of data. (While p-values ask, “How rare are these numbers?” a Bayesian approach asks, “What’s the probability my hypothesis is the best explanation for the results we’ve found?”)Related articlesThe Death of the Statistical Tests of HypothesesFor related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.DSC ResourcesInvitation to Join Data Science CentralFree Book: Applied Stochastic ProcessesComprehensive Repository of Data Science and ML ResourcesAdvanced Machine Learning with Basic ExcelDifference between ML, Data Science, AI, Deep Learning, and StatisticsSelected Business Analytics, Data Science and ML articlesHire a Data Scientist | Search DSC | Classifieds | Find a JobPost a Blog | Forum QuestionsSee More

]]>

]]>