Data Science Central2020-02-17T20:16:05ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFrumanhttps://storage.ning.com/topology/rest/1.0/file/get/2646154811?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1https://www.datasciencecentral.com/forum/topic/listForContributor?user=1rei895ofblem&feed=yes&xn_auth=noResume parsertag:www.datasciencecentral.com,2020-02-10:6448529:Topic:9298822020-02-10T05:23:31.347ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>I am working on a resume parser project. Currently, I am using rule-based regex to extract features like University, Experience, Large Companies, etc.</p>
<p>So basically I have a set of universities' names in a CSV, and if the resume contains one of them then I am extracting that as University Name. In the same way I have a list of Large Companies in CSV and if the resume contains any of them then I flag it as Yes.</p>
<p>So these are rule-based logic and can never be fool-proof considering…</p>
<p>I am working on a resume parser project. Currently, I am using rule-based regex to extract features like University, Experience, Large Companies, etc.</p>
<p>So basically I have a set of universities' names in a CSV, and if the resume contains one of them then I am extracting that as University Name. In the same way I have a list of Large Companies in CSV and if the resume contains any of them then I flag it as Yes.</p>
<p>So these are rule-based logic and can never be fool-proof considering different countries have different resume formats. Is there any other way of doing it to improve the accuracy and make it a global solution?</p> Different results between Python and Rtag:www.datasciencecentral.com,2020-02-07:6448529:Topic:9288862020-02-07T04:46:31.045ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>I'm new in the Data Science Central community and I need some help. I start a discussions, maybe the community can help me to understand how wrong I can be in Python. Coming from the Business Intelligence field and trying to advance in Machine Learning and Big Data.</p>
<p>I started learning R maybe 5 years ago. No spectacular projects, just few courses on EDX and Coursera with their applications. This year I wanted to learn Spark to start a project with a lot of uncertainty about the volume…</p>
<p>I'm new in the Data Science Central community and I need some help. I start a discussions, maybe the community can help me to understand how wrong I can be in Python. Coming from the Business Intelligence field and trying to advance in Machine Learning and Big Data.</p>
<p>I started learning R maybe 5 years ago. No spectacular projects, just few courses on EDX and Coursera with their applications. This year I wanted to learn Spark to start a project with a lot of uncertainty about the volume of data and features, applications and so on. I observed I need a new language to have a better compatibility with Spark and I chosen Python. Then I started to do the same things I did in R but with Python - for example the Analytics Edge course on EDX, which offers a good learning opportunity in my opinion.</p>
<p>My first surprise was to see how many different libraries are in Python that do the same things. And then sklearn that doesn't offer minimal details of, for example, a linear regression summary including statistical significance of the coefficients and at least adjusted R2 for the model. But only R2, intercept and coefficients. To get these details I should use statsmodels.</p>
<p>Then I moved forward to classification and regressions trees, random forest, and clustering. Completely different trees and clusters resulted in Python. I took the default parameters from R and used in Python. The same, completely different trees, different splitting nodes on different variables and splitting values. Also the results are wrong - for example Men in Black movie clustered as a Comedy rather than an Action + Adventure + SciFi movie.</p>
<p>I remember when I learned R it was so smooth and the results were very accurate.</p>
<p>I wonder what to do. I just needed Python to by able to approach Spark. Now I'm so confused: if we get to people, one using R and the other using Python, then these two people can come to completely different results, recommending their management, for example, completely different actions.</p>
<p>Having so much randomness in the field, one more level brought by the difference between the used platform adds even more odds.</p>
<p></p>
<p>Thank you very much for contributing with your advises.</p> Selecting forecasting methods using AvgRelMAE in Exceltag:www.datasciencecentral.com,2020-02-02:6448529:Topic:9277342020-02-02T22:08:53.325ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>Hello,</p>
<p></p>
<p>I am working on a proof-of-concept in Excel to determine the "best" (i.e. least loss) forecast time series model.</p>
<p></p>
<p>In Excel there will be various forecast models, and based on the AvgRelMAE (average relative MAE using the geometric mean (Davydenko, A., & Fildes, R. (2016). Forecast Error Measures: Critical Review and Practical Recommendations. In Business Forecasting: Practical Problems and Solutions. John Wiley & Sons Inc.) it would select the…</p>
<p>Hello,</p>
<p></p>
<p>I am working on a proof-of-concept in Excel to determine the "best" (i.e. least loss) forecast time series model.</p>
<p></p>
<p>In Excel there will be various forecast models, and based on the AvgRelMAE (average relative MAE using the geometric mean (Davydenko, A., & Fildes, R. (2016). Forecast Error Measures: Critical Review and Practical Recommendations. In Business Forecasting: Practical Problems and Solutions. John Wiley & Sons Inc.) it would select the model with the least loss.</p>
<p></p>
<p>Does anyone have experience in setting up this calculation in Excel?</p>
<p></p>
<p>Thanks,</p>
<p>Steven</p> Project Hunting for Masters Capstonetag:www.datasciencecentral.com,2020-01-18:6448529:Topic:9240122020-01-18T01:14:58.668ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>Hello Everyone,</p>
<p>I am looking for a project for my masters Capstone course. I would like to have a project more on the commercial side, real time datasets used by companies. I can work on deep learning concepts too. Any suggestions for the project are welcomed.</p>
<p>Hello Everyone,</p>
<p>I am looking for a project for my masters Capstone course. I would like to have a project more on the commercial side, real time datasets used by companies. I can work on deep learning concepts too. Any suggestions for the project are welcomed.</p> Suitability of Augmented Analytics...tag:www.datasciencecentral.com,2020-01-07:6448529:Topic:9208382020-01-07T10:28:52.753ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>Dear All, </p>
<p>It would be really great if someone answers to the following question in some detail:</p>
<p>Well, according to Gartner, "Augmented Analytics" is the future of data and analytics.</p>
<p>Given that an "augmented analytics system" can analyze both structured and unstructured data, my question is: where you can apply it and where you cannot.</p>
<p>I think clear broad conception about applications of "augmented analytics" would be helpful for us.</p>
<p>Best…</p>
<p>Dear All, </p>
<p>It would be really great if someone answers to the following question in some detail:</p>
<p>Well, according to Gartner, "Augmented Analytics" is the future of data and analytics.</p>
<p>Given that an "augmented analytics system" can analyze both structured and unstructured data, my question is: where you can apply it and where you cannot.</p>
<p>I think clear broad conception about applications of "augmented analytics" would be helpful for us.</p>
<p>Best regards,</p>
<p>Salman</p> What is BERT update (NLP)?tag:www.datasciencecentral.com,2019-12-31:6448529:Topic:9189332019-12-31T06:56:53.506ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p><span style="font-weight: 400;">Natural language processing (NLP) is a branch of artificial intelligence designed to help machines. But how will BERT will be helpful for machines and how does it works?</span></p>
<p><span style="font-weight: 400;">Natural language processing (NLP) is a branch of artificial intelligence designed to help machines. But how will BERT will be helpful for machines and how does it works?</span></p> Creating Polynomial Features in ML using sklearntag:www.datasciencecentral.com,2019-12-26:6448529:Topic:9176682019-12-26T14:43:34.797ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>I have 10 features and all of them are numeric.</p>
<p>Does polynomial features only can be used on continuous variables and not on discrete variables??<br/> Out of 10 features which I should pick for creating polynomial features??Choosing criteria..</p>
<p>I should take independent features for creating polynomial features or I should take features which are highly correlated with the dependent Y variable?? </p>
<p></p>
<p>I have 10 features and all of them are numeric.</p>
<p>Does polynomial features only can be used on continuous variables and not on discrete variables??<br/> Out of 10 features which I should pick for creating polynomial features??Choosing criteria..</p>
<p>I should take independent features for creating polynomial features or I should take features which are highly correlated with the dependent Y variable?? </p>
<p></p> NLP: POS Tagger for French languagetag:www.datasciencecentral.com,2019-11-25:6448529:Topic:9104872019-11-25T00:09:45.556ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>Hi,</p>
<p></p>
<p>I'm new in the context NLP and i search POSTAGGER for french</p>
<p>I already use Spacy but results are not optimal for french language.</p>
<p>Can you help me?</p>
<p>Thanks you</p>
<p>Hi,</p>
<p></p>
<p>I'm new in the context NLP and i search POSTAGGER for french</p>
<p>I already use Spacy but results are not optimal for french language.</p>
<p>Can you help me?</p>
<p>Thanks you</p> How to deal with missing datatag:www.datasciencecentral.com,2019-11-22:6448529:Topic:9100832019-11-22T17:49:51.795ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p><em>Originally posted by<span> </span><a href="https://www.datasciencecentral.com/profile/VincentOAjayi" rel="noopener" target="_blank">Vincent Ajayi</a>. </em></p>
<p>The most common challenge faced by data scientists (DS) and data analysts (DA) is missing data. Every day, both DA and DS spend several hours dealing with missing data. The question is why is missing data a problem? Analysts presume that all variables should have a particular value at a particular state, and when there is no…</p>
<p><em>Originally posted by<span> </span><a href="https://www.datasciencecentral.com/profile/VincentOAjayi" target="_blank" rel="noopener">Vincent Ajayi</a>. </em></p>
<p>The most common challenge faced by data scientists (DS) and data analysts (DA) is missing data. Every day, both DA and DS spend several hours dealing with missing data. The question is why is missing data a problem? Analysts presume that all variables should have a particular value at a particular state, and when there is no value for the variable, we refer to it as missing data. Missing data can have severe effects on a statistical model and ignoring it may lead to a biased estimate that may invalidate statistical results.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3730889478?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3730889478?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In this article, I will suggest ways to resolve the problem of missing data. Although different studies have suggested various methods to deal with missing data, I have noticed that none of these methods have theoretical or mathematical support to justify their processes. In this article, I will analysis the nine essential steps a data scientist must follow to address the issue of missing data. The steps are based on my personal experience as a quantitative researcher and data scientist for more than 7 years.</p>
<p><strong>Basic steps for dealing with missing data</strong></p>
<ol>
<li><strong>Aims and objectives</strong>: Before jumping to any method of estimating missing data, we must know the motivation behind the project to identify the research problem. The aim of the project must be outlined to specify key variables that are likely to be relevant for the project. You must be able to list the relevant data that can help answer the questions that define the objectives of the project.</li>
<li><strong>Check for the appropriate variable:</strong><span> </span>If you have been provided with the dataset, ask yourself a question: does the dataset contain all the relevant variables needed to address the research questions? For example, a data scientist may be interested in predicting inflation with the help of the multivariate model, and the data received might not contain likely inflation indicators such as consumer price or GDP deflator. To address the issue, you should contact your line manager or the data department to provide you with the appropriate dataset that contains the relevant variable.</li>
<li><strong>Visualise the data and check for the missing value:</strong><span> </span>If there is a missing value, check with the database; remember the best approach for finding a missing value is to look for the value at the source. It may be possible that there are problems with the extraction process.</li>
<li><strong>Variable substitution:</strong><span> </span>A straightforward way to deal with missing data is to substitute the variable with a similar indicator, especially if a large percentage of the data is missing. I strongly suggest using another indicator to replace the missing value, especially for continuous variables. For example, the GDP deflator could be used instead of the consumer price index to measure or forecast inflation. However, one needs to be careful in applying this method because different proxies for different variables may lead to different outcomes or results.</li>
<li><strong>Mean/ Mode/ Median substitution:</strong><span> </span>This method can be applied if the percentage of the missing value is smaller (e.g., less than 30%). For continuous variables, the missing value can be replaced by its median or mean value. For the category variable, the missing value can be replaced by its model value. The limitation of this method is that it reduces the variability of your data.</li>
<li><strong>Delete the missing attribute</strong>: If a large percentage of the data is missing (e.g., more than 30%), all the rows or columns can be dropped, if the variable is an independent variable and not depend on the dependent variable as well as not relevant to the model. For example, if you want to use multiple regression to predict revenue and have a variable on a product number that has a missing number, the variable could be removed instead of filling the missing value. Note that you may lose samples, important information and underfit the model.</li>
<li><strong>Evaluation</strong><span> </span><strong>and prediction</strong>: You can use different statistical models or theoretical models to estimate or predict the missing value. For instance, statistical models can estimate or predict the missing value from the available dataset.</li>
<li><strong>Apply sophisticated statistical models that are robust in</strong><span> </span><strong>handling missing data without requiring imputation</strong>: For example, if you have missing data, the XGBoost model can be applied for prediction instead of using linear regression. The XGBoost model will handle the missing values by default. The model will minimise the training loss and choose the best imputation value for the dataset when the value is missing.</li>
<li><strong>Sample reduction:</strong><span> </span>This step applies to the time-series data, if you have missing data, the sample can be reduced to avoid looking for the missing value and base the estimation on a reduced sample that does not has missing value. Note that sample reduction can significantly affect the precision and accuracy of the results<strong>.</strong></li>
</ol> Simulating Distributions with One-Line Formulas, even in Exceltag:www.datasciencecentral.com,2019-11-10:6448529:Topic:9069802019-11-10T18:24:47.066ZAlexander Frumanhttps://www.datasciencecentral.com/profile/AlexanderFruman
<p>If you don't like using black-box R functions, or you don't have access to these functions, here are simple options to simulate deviates from various distributions. They can even be implemented in Excel! You first need to simulate uniform deviates on [0, 1]. If you don't trust the function available in your programming language, here is a good alternative:</p>
<p><br></br> rnd = 1000<br></br> for (n=0; n<20000; n++) {<br></br> rnd=(10232193 * rnd + 3701101) % 54198451371<br></br> Rand= rnd /…</p>
<p>If you don't like using black-box R functions, or you don't have access to these functions, here are simple options to simulate deviates from various distributions. They can even be implemented in Excel! You first need to simulate uniform deviates on [0, 1]. If you don't trust the function available in your programming language, here is a good alternative:</p>
<p><br/> rnd = 1000<br/> for (n=0; n<20000; n++) {<br/> rnd=(10232193 * rnd + 3701101) % 54198451371<br/> Rand= rnd / 54198451371</p>
<p>}</p>
<p>This code produces 20,000 deviates of a uniform distribution on [0, 1]. The deviates are stored in the variable named Rand. The symbol % stands for the modulo operator.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3706746208?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3706746208?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Now, assuming Rand, Rand1 and Rand2 are uniform deviates on [0, 1], here is how to sample deviates from various other distributions:</p>
<p><strong>Normal(0, 1) and log-normal deviates</strong>:</p>
<ul>
<li><span style="text-decoration: underline;">Normal</span>: x = sqrt(-2* log(Rand1)) * cos(2* Pi *Rand2) </li>
<li><span style="text-decoration: underline;">Log-normal</span>: y = exp(x)</li>
</ul>
<p><strong>Exponential deviates of parameter Lambda:</strong></p>
<ul>
<li>x = - log(1 - Rand) / Lambda</li>
</ul>
<p><strong>Geometric deviates of parameter P:</strong></p>
<ul>
<li>if (Rand < P) { x = 0 } else { x = int(log(1 - Rand) / log(1 - P)) }</li>
</ul>
<p><strong>Power law deviates with exponent B, on [0, A]:</strong></p>
<ul>
<li>x = A * Rand^(1 / B)</li>
</ul>
<p>Do you know any simple formula to generate other types of deviates?</p>
<p></p>