Subscribe to DSC Newsletter

Important Note: Our Data Science Cheat Sheet is now available. Please read it (and follow instructions as needed) if you are not familiar with UNIX, R and scripting languages. This is the minimum stuff that you need to know to get started - if you start from scratch. Most candidates in our DSA are already familiar with the concepts explained in our cheat sheet.

Below is the updated list of available projects, for participants in our data science apprenticeship (DSA) program. It includes four business / applied data science and two data science research projects. In addition to these projects, we strongly encourage you to participate in our data science challenges.

Project #8: Detecting fake reviews on Amazon

Business and Applied Data Science

  1. Clustering 2,000+ data science websites, based on websites/blogs suggested by DSC members, scraping these websites (eliminate dead ones) and matching each of them against a pre-selected list of 100 top data science keywords (machine learning, hadoop, data mining, text mining, business analytics, predictive modeling, big data etc.) You need to build that keyword list using Google Analytics or other tools, such as related keywords. Then you must count keyword frequency for each website website, for each keyword in the list. And finally, perform website clustering based on these counts. In addition, we would use member sign-up date as a proxy to measure when a particular website is first mentioned. The project requires data cleaning, and production of scores that measures website popularity and trends, based on number of citations (by DSC members) broken down per year. It is based on data gathered on our members, after I have eliminated fields containing personal information. Break down websites in two categories, for this analysis: those with domain name containing words such as data, analytic, or stat, and the other ones.
  2. RSS Feed Exchange. Detecting reputable big data, data science and analytics digital publishers that accept RSS feeds (click here for details), and create an RSS feed exchange where publishers can swap or submit feeds.
  3. Analize 40,000 web pages to optimize content. I can share some traffic statistics about 40,000 pages on DSC, and you work on the data to identify the types of articles and other metrics associated with success (and how do you measure success in the first place?), such as identifying great content for our audience, forecasting articles' lifetime and pageviews based on subject line or category, assessing impact of re-tweets, likes, and sharing on traffic, and detecting factors impacting Google organic traffic. Also, designing a tool to identify new trends and hot keywords would be useful. Lot's of NLP - natural language processing - involved in this type of project; it might also require crawling our websites. Finally, categorizing each page - creating a page taxonomy - to suggest "related articles" at the bottom of each article or forum question.  This project may not be available to all participants; it requires signing an NDA.
  4. URL shortener that correctly counts traffic. Another potential project is the creation of a redirect URL shortener like http://bit.ly, but one that correctly counts the number of clicks. Bit.ly (and also the Google URL shorterner) provides statistics that are totally wrong for traffic originating from email clients (e.g. Outlook, which represents our largest traffic source). Their numbers are inflated by more than 300%. It's possible that an easy solution consists of counting and reporting the number of users/visitors (after filtering out robots), rather than pageviews. Test your URL re-director and make sure only real human beings are counted (not robots or fake traffic).
  5. Meaningful list and categorization of top data scientists, Other project: create a list of top 500 data scientists or big data experts using public data such as Twitter, and rate them based on number of followers or better criteria (also identify new stars and trends - note that new stars have fewer followers even though they might be more popular, as it takes time to build a list of followers). Classify top practitioners into a number of categories (unsupervised clustering) based on their expertise (identified by keywords or hashtags in their postings). Filter out automated from real tweets - in short identify genuine tweets posted by the author rather than feeds automatically blended with the author's tweets (you can try with my account @AnalyticBridge, which is a blend of external RSS feeds with my own tweets - some posted automatically, some manually). Create groups of data scientists. I started a similar analysis a while back, click here for details.
  6. Data science website. Creating and monetizing (maybe via Amazon books) a blog like ours from scratch, using our RSS feed to provide initial content to visitors: see  http://businessintelligence.com/ for an example of such a website - not producing content, but instead syndicating content from other websites. Scoop.it (and many more) have a similar business model.
  7. Creating niche search engine and taxonomy of all data science / big data / analytcs websites, using selected fields from our anonymized member database, and a web crawler.  In short, it consists of creating a niche search engine for datascience, better than Google, and a taxonomy for these websites. Candidates interested in this project will have access to the full data, not just the sample that we published. Because this is based on data submitted by users, the raw data is quite messy and requires both cleaning and filtering. I actually completed this project myself (a basic version) in a couple of hours, and you can re-use all my tools, including my script (web crawler) - it's a good example of code used to clean relatively unstructured data. However, it is expected that you will create a much better version of this taxonomy, using a better seed keyword list (you will have to create it) and true clustering of all the data science websites. Read our section Possible Improvements in our article Top 2,500 Data Science, Big Data and Analytics Websites: this actually describes what you might want to do to make this taxonomy better (more comprehensive, user friendly etc.) You will also find many of the tools and explanations.
  8. Detecting Fake Reviews. Click here for details. In this project, you will have to assess the proportion of fake book reviews on Amazon, test a fake review generator, reverse engineer an Amazon algorithm, and identify how the review scoring engine can be improved. Extra mile: create and test your own review scoring engine. Scrape thousands of sampled Amazon reviews and score them, as well as users posting these reviews. Before starting, read this article.

Data Science Research

  1. Spurious correlations in big data, how to detect and fix it. You have n = 5,000 variables uniformly distributed on [0,1]. What is the expected number m of correlations that are above p = 0.95. Perform simulations or find theoretical solution. Try with various values of n (from 5,000 to 100,000) and p (from 0.80 to 0.99) and obtain confidence intervals for m (m is a function of n and p). Identify better indicators than correlation to measure whether two time series are really related. The purpose here is twofold: (1) to show that with big data, your strongest correlations are likely to be spurious, and (2) to identify better metrics than correlation in this context. A starting point is my article The curse of big data, also in my book pages 41-45. Or read my article on strong correlations ans answers questions in section 5 and 6.
  2. Robust, simple, multi-usage regression tool for automated data science. The jackknife regression project involves simulated data to create a very robust and simple tool to perform regression and even clustering.  We would like to test the Jackknife regression when applying the clustering step, to group variables into 2, 3 or 4 subsets, to see the improvement in the context of predictive modeling. This is described in section 3 in our original article. In particular, we would like to see the improvement when we have a million variables (thus 0.5 trillion correlations) and use sampling techniques to pick up 10 million correlations (using a robust correlation metric) out of these 0.5 trillion, grouping variables using an algorithm identical to our sparse keyword clustering algor.... So, instead of using a 1 million by 1 million correlation table (for the similarity matrix), we would use an hash table of size 10 million, where each entry consists of a pair-value $hash{Var A | Var B}=Corr(A,B). This is 50,000 times more compact than using the full matrix, and nicely exploits sparsity in the data. Then we would like to measure the loss of accuracy by using a sample 50,000 times smaller than the (highly-redundant) full correlation matrix. Click here for details
  3. Cracking the maths that make all financial transactions secureClick here for details. A related project consits in designing great non-periodic random number simulators based on digits of irrational numbers. A starting point on this subject is our article Curious formula generating all digits of square root numbers.
  4. Great random number generatorMost random number generators use an algorithm a(k+1) = f(a(k)) to produce a sequence of integers a(1), a(2), etc. that behaves like random numbers. The function f is integer-valued and bounded; because of these two conditions, the sequence a(k) eventually becomes periodic for k large enough. This is an undesirable property, and many public random number generators (those built in Excel, Python, and other languages) are poor and not suitable for cryptographic applications, Markov Chains Monte-Carlo associated with hierarchical Bayesian models, or large-scale Monte-Carlo simulations to detect extreme events (example: fraud detection, big data context). Click here for details about this project.
  5. Solve the Law of Series problem. Why do we get 4 deadly plane crashes in 4 months, and nothing in several years? This is explained by probablity laws. Read our article, download our simulations (the password for our Excel spreadsheet is 5150) and provide the mathematical solution, using our numerous hints. This project helps you detect coincidences that are just coincidences, versus those that are not. Useful if you want to specialize in root cause analysis, or data science forensics / litigation.
  6. Zipf's law. Simulate millions of data points and a moving cluster structure (evolving over 20,000,000 iterations) that mimics a very special kind of cluster process - not unlike the formation of companies or galaxy systems - to prove or disprove my explanation about the origin of these mysterious but widespread Zipf systems. This project will also get you familiar with model fitting techniques, as well as programming in Perl, Java, C++ or Python. Zipf processes are a feature of some big data sets, and usually not found in small data sets. Click here for details. Additional questions: (1) Can the simulation algorithm (read section 4 in the reference article) be adapted to a distributed environment, and how? (2) Find a public data set that perfectly illustrates the Zipf distribution - explain, based on your computations and analyzes, why your selected data set is a great example. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 26401

Replies to This Discussion

For the spurious correlations project, you could actually create one variable X with arbitrary but fixed values, and check how it correlates with thousands of simulated variables. The reason being that any set of observations for X has the same probability to occur, under the uniform distribution assumption. This considerably reduces the number of computations, turning a O(n^2) problem with large n, into O(n), from a computational complexity point of view.

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2016   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service