Data Science Central2019-11-22T23:15:23Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertznerhttps://api.ning.com/files/I5Q3A4NUOoKZu3PsD7oI2m9IbFScacCiYpTSXrD-ic*FRdqF8dgLFro4rZl6kZhAw2fLcd3mPKrYAkQBg1IfwS3HdI*hCm9I/genericavatar.png?width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/forum/topic/listForContributor?user=1mmmqayweskqx&feed=yes&xn_auth=noHow to deal with missing datatag:www.datasciencecentral.com,2019-11-22:6448529:Topic:9100832019-11-22T17:49:51.795Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p><em>Originally posted by<span> </span><a href="https://www.datasciencecentral.com/profile/VincentOAjayi" rel="noopener" target="_blank">Vincent Ajayi</a>. </em></p>
<p>The most common challenge faced by data scientists (DS) and data analysts (DA) is missing data. Every day, both DA and DS spend several hours dealing with missing data. The question is why is missing data a problem? Analysts presume that all variables should have a particular value at a particular state, and when there is no…</p>
<p><em>Originally posted by<span> </span><a href="https://www.datasciencecentral.com/profile/VincentOAjayi" target="_blank" rel="noopener">Vincent Ajayi</a>. </em></p>
<p>The most common challenge faced by data scientists (DS) and data analysts (DA) is missing data. Every day, both DA and DS spend several hours dealing with missing data. The question is why is missing data a problem? Analysts presume that all variables should have a particular value at a particular state, and when there is no value for the variable, we refer to it as missing data. Missing data can have severe effects on a statistical model and ignoring it may lead to a biased estimate that may invalidate statistical results.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3730889478?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3730889478?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In this article, I will suggest ways to resolve the problem of missing data. Although different studies have suggested various methods to deal with missing data, I have noticed that none of these methods have theoretical or mathematical support to justify their processes. In this article, I will analysis the nine essential steps a data scientist must follow to address the issue of missing data. The steps are based on my personal experience as a quantitative researcher and data scientist for more than 7 years.</p>
<p><strong>Basic steps for dealing with missing data</strong></p>
<ol>
<li><strong>Aims and objectives</strong>: Before jumping to any method of estimating missing data, we must know the motivation behind the project to identify the research problem. The aim of the project must be outlined to specify key variables that are likely to be relevant for the project. You must be able to list the relevant data that can help answer the questions that define the objectives of the project.</li>
<li><strong>Check for the appropriate variable:</strong><span> </span>If you have been provided with the dataset, ask yourself a question: does the dataset contain all the relevant variables needed to address the research questions? For example, a data scientist may be interested in predicting inflation with the help of the multivariate model, and the data received might not contain likely inflation indicators such as consumer price or GDP deflator. To address the issue, you should contact your line manager or the data department to provide you with the appropriate dataset that contains the relevant variable.</li>
<li><strong>Visualise the data and check for the missing value:</strong><span> </span>If there is a missing value, check with the database; remember the best approach for finding a missing value is to look for the value at the source. It may be possible that there are problems with the extraction process.</li>
<li><strong>Variable substitution:</strong><span> </span>A straightforward way to deal with missing data is to substitute the variable with a similar indicator, especially if a large percentage of the data is missing. I strongly suggest using another indicator to replace the missing value, especially for continuous variables. For example, the GDP deflator could be used instead of the consumer price index to measure or forecast inflation. However, one needs to be careful in applying this method because different proxies for different variables may lead to different outcomes or results.</li>
<li><strong>Mean/ Mode/ Median substitution:</strong><span> </span>This method can be applied if the percentage of the missing value is smaller (e.g., less than 30%). For continuous variables, the missing value can be replaced by its median or mean value. For the category variable, the missing value can be replaced by its model value. The limitation of this method is that it reduces the variability of your data.</li>
<li><strong>Delete the missing attribute</strong>: If a large percentage of the data is missing (e.g., more than 30%), all the rows or columns can be dropped, if the variable is an independent variable and not depend on the dependent variable as well as not relevant to the model. For example, if you want to use multiple regression to predict revenue and have a variable on a product number that has a missing number, the variable could be removed instead of filling the missing value. Note that you may lose samples, important information and underfit the model.</li>
<li><strong>Evaluation</strong><span> </span><strong>and prediction</strong>: You can use different statistical models or theoretical models to estimate or predict the missing value. For instance, statistical models can estimate or predict the missing value from the available dataset.</li>
<li><strong>Apply sophisticated statistical models that are robust in</strong><span> </span><strong>handling missing data without requiring imputation</strong>: For example, if you have missing data, the XGBoost model can be applied for prediction instead of using linear regression. The XGBoost model will handle the missing values by default. The model will minimise the training loss and choose the best imputation value for the dataset when the value is missing.</li>
<li><strong>Sample reduction:</strong><span> </span>This step applies to the time-series data, if you have missing data, the sample can be reduced to avoid looking for the missing value and base the estimation on a reduced sample that does not has missing value. Note that sample reduction can significantly affect the precision and accuracy of the results<strong>.</strong></li>
</ol> Simulating Distributions with One-Line Formulas, even in Exceltag:www.datasciencecentral.com,2019-11-10:6448529:Topic:9069802019-11-10T18:24:47.066Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>If you don't like using black-box R functions, or you don't have access to these functions, here are simple options to simulate deviates from various distributions. They can even be implemented in Excel! You first need to simulate uniform deviates on [0, 1]. If you don't trust the function available in your programming language, here is a good alternative:</p>
<p><br></br> rnd = 1000<br></br> for (n=0; n<20000; n++) {<br></br> rnd=(10232193 * rnd + 3701101) % 54198451371<br></br> Rand= rnd /…</p>
<p>If you don't like using black-box R functions, or you don't have access to these functions, here are simple options to simulate deviates from various distributions. They can even be implemented in Excel! You first need to simulate uniform deviates on [0, 1]. If you don't trust the function available in your programming language, here is a good alternative:</p>
<p><br/> rnd = 1000<br/> for (n=0; n<20000; n++) {<br/> rnd=(10232193 * rnd + 3701101) % 54198451371<br/> Rand= rnd / 54198451371</p>
<p>}</p>
<p>This code produces 20,000 deviates of a uniform distribution on [0, 1]. The deviates are stored in the variable named Rand. The symbol % stands for the modulo operator.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3706746208?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3706746208?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Now, assuming Rand, Rand1 and Rand2 are uniform deviates on [0, 1], here is how to sample deviates from various other distributions:</p>
<p><strong>Normal(0, 1) and log-normal deviates</strong>:</p>
<ul>
<li><span style="text-decoration: underline;">Normal</span>: x = sqrt(-2* log(Rand1)) * cos(2* Pi *Rand2) </li>
<li><span style="text-decoration: underline;">Log-normal</span>: y = exp(x)</li>
</ul>
<p><strong>Exponential deviates of parameter Lambda:</strong></p>
<ul>
<li>x = - log(1 - Rand) / Lambda</li>
</ul>
<p><strong>Geometric deviates of parameter P:</strong></p>
<ul>
<li>if (Rand < P) { x = 0 } else { x = int(log(1 - Rand) / log(1 - P)) }</li>
</ul>
<p><strong>Power law deviates with exponent B, on [0, A]:</strong></p>
<ul>
<li>x = A * Rand^(1 / B)</li>
</ul>
<p>Do you know any simple formula to generate other types of deviates?</p>
<p></p> Hybrid method of Data Envelopment Analysis with Supervised Learningtag:www.datasciencecentral.com,2019-11-10:6448529:Topic:9069392019-11-10T01:59:48.153Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>Dear members of data science central,</p>
<p>I look forward for any suggestions from anyone, related to my paper about convenience store performance measurement.</p>
<p><strong>Background Problems</strong>: Convenience stores have recently been a trend place of daily necessities shopping for Indonesians. This condition boost the growth of convenience store’s numbers and encourage the management to improve its performance in order to face tight business competition, while the performance of…</p>
<p>Dear members of data science central,</p>
<p>I look forward for any suggestions from anyone, related to my paper about convenience store performance measurement.</p>
<p><strong>Background Problems</strong>: Convenience stores have recently been a trend place of daily necessities shopping for Indonesians. This condition boost the growth of convenience store’s numbers and encourage the management to improve its performance in order to face tight business competition, while the performance of convenience stores is actually determined by the efficiency of various product categories. In relation to this, the concept of benchmarking through Data Envelopment Analysis (DEA) is one of the well-known method used to measure company’s efficiency that can be utilized to measure firm performance. However, DEA has limitation in handling large amounts of data, but supervised learning technique can be used as an alternative method to overcome it.</p>
<p><strong>Main Objectives</strong>: This study provide an integrated model that applies benchmarking concept and supervised learning technique to measure performance of convenience store by considering the efficiency of various product categories.</p>
<p><strong>Novelty</strong>: This is the first study that utilizes SVM algorithm besed on DEA for measuring performance of a local convenience store.</p>
<p><strong>Research Methods</strong>: The proposed approach has several steps. First step, calculating efficiency score product categories using DEA method. Second step, using the effeciency score as class feature for the data set to train the SVM model through K-Fold 5 cross validation, then predicting the efficiency score based on the test set. Final step, evaluating the number of efficient and inefficient product categories to determine the performance of convenience store.</p>
<p><strong>Conclusion</strong>: The proposed method has been successfully established and proven valid in predicting efficiency of products category to measure convenience store performance. Furthermore, this present research indicates that local convenience store has 39.4% inefficient product categories, while 60.6 % other product categories are efficient.</p> Artificial Intelligence Taxonomytag:www.datasciencecentral.com,2019-10-30:6448529:Topic:9037532019-10-30T18:18:06.412Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>Hello DSC members,</p>
<p>I am trying to get a better understanding of AI Taxonomy. I did a google search and found an article by Bernard Golstein yet I'm looking to understand other possible approaches towards developing AI taxonomy that you may prefer and would be willing to share.</p>
<p>Hello DSC members,</p>
<p>I am trying to get a better understanding of AI Taxonomy. I did a google search and found an article by Bernard Golstein yet I'm looking to understand other possible approaches towards developing AI taxonomy that you may prefer and would be willing to share.</p> Data science degreetag:www.datasciencecentral.com,2019-10-16:6448529:Topic:8995432019-10-16T21:24:56.692Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>Dear forum members,</p>
<p></p>
<p>I have started working as a customer data insight analyst after working as a consultant in a different domain for 14 years.</p>
<p>I got this job because i know general sql and python and formally educated in mathematics and computer applications.</p>
<p></p>
<p>My job involves customer churn analysis and my company is using mostly excel /tableau, i am exploring few python libraries like pandas but i am not able to implement the data science concepts like…</p>
<p>Dear forum members,</p>
<p></p>
<p>I have started working as a customer data insight analyst after working as a consultant in a different domain for 14 years.</p>
<p>I got this job because i know general sql and python and formally educated in mathematics and computer applications.</p>
<p></p>
<p>My job involves customer churn analysis and my company is using mostly excel /tableau, i am exploring few python libraries like pandas but i am not able to implement the data science concepts like predictive analysis due to pressure to produce outputs and i end up working in excel.</p>
<p></p>
<p>In my company, there is no data scientist and people are inclined to use excel but I am aspiring to become a data scientist but not formally educated in data science.</p>
<p></p>
<p>Can anyone suggest me if taking a data science degree can speed up my skills to apply the data science techniques in my company?</p>
<p></p>
<p>Regards,</p>
<p>Lucky </p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p> </p> Diminishing returns in econometricstag:www.datasciencecentral.com,2019-10-15:6448529:Topic:8985692019-10-15T11:04:09.037Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>I was wondering if anyone here has much experience in building econometric models - specifically in calculating diminishing returns as there are tonnes of different ways to go about this. For simplicity, I have previously used <span>an exponential decay (e to the power of -(a.x) where a is the rate of diminishing returns and x is the rate of media spend - but there are many other ways to model this (e.g. Linear log models, Multiplicative Competitive Interaction) and I'd be interested to hear…</span></p>
<p>I was wondering if anyone here has much experience in building econometric models - specifically in calculating diminishing returns as there are tonnes of different ways to go about this. For simplicity, I have previously used <span>an exponential decay (e to the power of -(a.x) where a is the rate of diminishing returns and x is the rate of media spend - but there are many other ways to model this (e.g. Linear log models, Multiplicative Competitive Interaction) and I'd be interested to hear of people's experiences as to which of these have worked well.</span></p>
<p></p> Recommendation on a data visualization booktag:www.datasciencecentral.com,2019-09-30:6448529:Topic:8925312019-09-30T05:28:07.985Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>I was looking for the best <font style="background-color: #ffffff;">data visualization book</font> that I should have. Any recommendations? Thanks in advance</p>
<p>I was looking for the best <font style="background-color: #ffffff;">data visualization book</font> that I should have. Any recommendations? Thanks in advance</p> Optimization algotag:www.datasciencecentral.com,2019-09-26:6448529:Topic:8912922019-09-26T11:30:04.542Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>Hi all. <br></br>Reading an article about ELO rank I have a question. The probability of the "A" team win is a sigmoid function like 1 / (1 + exp( RankB -RankA))<br></br>and after the game we need to update these ranks like Rank_new = Rank_old +- K*(1(0) - probability) <br></br><br></br>So the main question is how I can use for example NN( or other algo) for finding "K" parameter to make binary crossentropy minimum. And I hope it musn't be constant (I want to find dependent from the initial player…</p>
<p>Hi all. <br/>Reading an article about ELO rank I have a question. The probability of the "A" team win is a sigmoid function like 1 / (1 + exp( RankB -RankA))<br/>and after the game we need to update these ranks like Rank_new = Rank_old +- K*(1(0) - probability) <br/><br/>So the main question is how I can use for example NN( or other algo) for finding "K" parameter to make binary crossentropy minimum. And I hope it musn't be constant (I want to find dependent from the initial player rating) <br/><br/>The main my problem that I can't understand is that we need after updating parameters use a new input rank for calculating probability. So every epoch we need to update input <br/><br/><br/></p> Insight in datatag:www.datasciencecentral.com,2019-09-18:6448529:Topic:8892032019-09-18T08:58:11.152Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>I have a situation with a client. They have 4 sources of data and they are wanting to create a single metric out of these four values to gain a generalised insight into how the company is going overall.</p>
<p></p>
<p>The problem is that each source has a completely different scale and are not really comparable.</p>
<p></p>
<p>Source A has a scale in the millions where as Source B's scale is in the hundreds.</p>
<p></p>
<p>Further to this we wanted to weight each source as some provide more…</p>
<p>I have a situation with a client. They have 4 sources of data and they are wanting to create a single metric out of these four values to gain a generalised insight into how the company is going overall.</p>
<p></p>
<p>The problem is that each source has a completely different scale and are not really comparable.</p>
<p></p>
<p>Source A has a scale in the millions where as Source B's scale is in the hundreds.</p>
<p></p>
<p>Further to this we wanted to weight each source as some provide more value than others.</p>
<p></p>
<p>We decided to scale all four between 0 and 1 using this formula</p>
<p><span><span class="mrow" id="MathJax-Span-24"><span class="msubsup" id="MathJax-Span-25"><span class="mi" id="MathJax-Span-26">z</span><span class="mi" id="MathJax-Span-27">i</span></span><span class="mo" id="MathJax-Span-28">= </span><span class="mfrac" id="MathJax-Span-29"><span class="mrow" id="MathJax-Span-30"><span class="msubsup" id="MathJax-Span-31"><span class="mi" id="MathJax-Span-32">x</span><span class="mi" id="MathJax-Span-33">i</span></span><span class="mo" id="MathJax-Span-34">− </span><span class="mo" id="MathJax-Span-35">min</span><span class="mo" id="MathJax-Span-36">(</span><span class="mi" id="MathJax-Span-37">x</span><span class="mo" id="MathJax-Span-38">) / </span></span><span class="mrow" id="MathJax-Span-39"><span class="mo" id="MathJax-Span-40">max</span><span class="mo" id="MathJax-Span-41">(</span><span class="mi" id="MathJax-Span-42">x</span><span class="mo" id="MathJax-Span-43">)</span><span class="mo" id="MathJax-Span-44">−</span><span class="mo" id="MathJax-Span-45">min</span><span class="mo" id="MathJax-Span-46">(</span><span class="mi" id="MathJax-Span-47">x</span><span class="mo" id="MathJax-Span-48">)</span></span></span></span></span></p>
<p>and while its works I am confused as to what insight I can get out of the numbers.</p>
<p></p>
<p>Here is the google sheet I am preparing with</p>
<p><a href="https://docs.google.com/spreadsheets/d/1Eua7tmqD3B0l3M04QnXDcU5HCAmFIfP65lsA7l52604/edit?usp=sharing">https://docs.google.com/spreadsheets/d/1Eua7tmqD3B0l3M04QnXDcU5HCAmFIfP65lsA7l52604/edit?usp=sharing</a></p>
<p></p>
<p>If you look at cell H14 and H15 can you say that March was 3 times worse than Feb because the March score was 1.1 and the Feb was 3.2?</p>
<p></p>
<p>Thanks in advance</p>
<p></p>
<p></p>
<p></p>
<p></p> Cleaning responses to meet quotas after samplingtag:www.datasciencecentral.com,2019-09-15:6448529:Topic:8881302019-09-15T16:48:06.110Zmark kertznerhttps://www.datasciencecentral.com/profile/markkertzner
<p>I know that usually survey sampling is done the way that after a quota is reached, the survey is closed for respondents that would meet the criteria for that quota.</p>
<p>However, at the company I work at, the survey is open for everyone until every demographic quota is met; and only after that do we start deleting responses until the quotas are met. So for example if we need 500 cases (250 females and 250 males) and we closed the survey with 532 responses that have 273 females and 259…</p>
<p>I know that usually survey sampling is done the way that after a quota is reached, the survey is closed for respondents that would meet the criteria for that quota.</p>
<p>However, at the company I work at, the survey is open for everyone until every demographic quota is met; and only after that do we start deleting responses until the quotas are met. So for example if we need 500 cases (250 females and 250 males) and we closed the survey with 532 responses that have 273 females and 259 males, we delete 23 female and 9 male responses. It sounds easy, but most studies have 3-4 demographic quotas (e.g. gender, age group, region, settlement type), and it is really difficult and time-consuming to figure out what cases I have to delete to meet the quotas.</p>
<p>Is there any way or software that would calculate automatically what cases should be deleted?</p>