Sample Projects for Data Scientists in Training

Here is a list of potential projects to help you complete your master in data science or in a related field.

Project #8: Detecting fake reviews on Amazon

Business and Applied Data Science

Clustering 2,000+ data science websites, matching each of them against a pre-selected list of 100 top data science keywords (machine learning, AI, deep learning, IoT, Spark, NLP, business analytics, predictive modeling, big data etc.) You must count keyword frequency for each website website, for each keyword in the list. And finally, perform website clustering based on these counts. In addition, using publication date, relevancy (number of “likes” or comments per article) and posting frequency, if possible, will make the model more robust. The project requires data cleaning, and production of scores that measures website popularity and trends, broken down per year. See here for a solution.
RSS Feed Exchange. Detecting reputable big data, data science and analytics digital publishers that accept RSS feeds (click here for details), and create an RSS feed exchange where publishers can swap or submit feeds.
Analyze 40,000 web pages to optimize content. I can share some traffic statistics about 40,000 pages on DSC, and you work on the data to identify the types of articles and other metrics associated with success (and how do you measure success in the first place?), such as identifying great content for our audience, forecasting articles’ lifetime and pageviews based on subject line or category, assessing impact of re-tweets, likes, and sharing on traffic, and detecting factors impacting Google organic traffic. Also, designing a tool to identify new trends and hot keywords would be useful. Lot’s of NLP – natural language processing – involved in this type of project; it might also require crawling our websites. Finally, categorizing each page – creating a page taxonomy – to suggest “related articles” at the bottom of each article or forum question.
URL shortener that correctly counts traffic. Another potential project is the creation of a redirect URL shortener like http://bit.ly, but one that correctly counts the number of clicks. Bit.ly (and also the Google URL shortener) provides statistics that are totally wrong for traffic originating from email clients (e.g. Outlook, which represents our largest traffic source). Their numbers are inflated by more than 300%. It’s possible that an easy solution consists of counting and reporting the number of users/visitors (after filtering out robots), rather than pageviews. Test your URL re-director and make sure only real human beings are counted (not robots or fake traffic).
Meaningful list and categorization of top data scientists, Other project: create a list of top 500 data scientists or big data experts using public data such as Twitter, and rate them based on number of followers or better criteria (also identify new stars and trends – note that new stars have fewer followers even though they might be more popular, as it takes time to build a list of followers). Classify top practitioners into a number of categories (unsupervised clustering) based on their expertise (identified by keywords or hashtags in their postings). Filter out automated from real tweets – in short identify genuine tweets posted by the author rather than feeds automatically blended with the author’s tweets (you can try with my account @AnalyticBridge, which is a blend of external RSS feeds with my own tweets – some posted automatically, some manually). Create groups of data scientists. I started a similar analysis a while back, click here for details.
Data science website. Creating and monetizing (maybe via Amazon books) a blog like ours from scratch, using our RSS feed to provide initial content to visitors: see http://businessintelligence.com/ for an example of such a website – not producing content, but instead syndicating content from other websites. Scoop.it (and many more such as Medium.com, Paper.li, StumbleUpon.com) have a similar business model.
Creating niche search engine and taxonomy of all data science / big data / analytics websites, using selected fields from our anonymized member database, and a web crawler. In short, it consists of creating a niche search engine for data science, better than Google, and a taxonomy for these websites. Candidates interested in this project will have access to the full data, not just the sample that we published. Because this is based on data submitted by users, the raw data is quite messy and requires both cleaning and filtering. I actually completed this project myself (a basic version) in a couple of hours, and you can re-use all my tools, including my script (web crawler) – it’s a good example of code used to clean relatively unstructured data. However, it is expected that you will create a much better version of this taxonomy, using a better seed keyword list (you will have to create it) and true clustering of all the data science websites. Read our section Possible Improvements in our article Top 2,500 Data Science, Big Data and Analytics Websites: this actually describes what you might want to do to make this taxonomy better (more comprehensive, user friendly etc.) You will also find many of the tools and explanations.
Detecting Fake Reviews. Click here for details. In this project, you will have to assess the proportion of fake book reviews on Amazon, test a fake review generator, reverse engineer an Amazon algorithm, and identify how the review scoring engine can be improved. Extra mile: create and test your own review scoring engine. Scrape thousands of sampled Amazon reviews and score them, as well as users posting these reviews. Before starting, read this article.

Data Science Research

Improving Google search. How would you improve or design a search engine? See here and here for starting points. Google search has 4 major issues: favoring (very) old content over new one; inability to detect the real original source when an article is posted on multiple websites, despite the time stamp telling which one is the original; failure to detect web spammers; and favoring popular websites or business partners over high quality but unknown blogs (this is a business rather than a data science issue.) Also, how to deliver better ads to website visitors? See this article on ad matching technology. Despite sophisticated algorithms used by Google to increase your likelihood to click on an ad, all of us still see irrelevant ads most of the time.
Fixing Facebook’s text detection in images. As a Facebook advertiser promoting data science articles, most images in my ads are charts and do not necessarily contain text. For whatever business reason (probably an archaic rule invented long ago and never revisited) Facebook does not like postings (ads in particular) in which the image contains text. Such ads get penalized: they are displayed less frequently, and cost more per click; sometimes they are just rejected. How to fix this? Read more here if you want to help Facebook design a better algorithm, and implement business rules so that “good” advertisers that often fool Facebook algorithms into thinking their ads are bad, don’t get penalized, helping Facebook generate more revenue.
Create your own, legit lottery. How about creating a lottery system with a basic mathematical formula that can predict all future results, as well as identify all past results? From a legal point of view, it would not be a game of chance, but a mathematical challenge. If you publish your mathematical formula — so that anyone can use it to compute future (or past) winning numbers — but the formula require billions of years of computing power to actually compute the next winning numbers, yet you have an alternate (secret) formula that computes these numbers in 1 second, then you have your own legit “lottery.” How could you achieve this? There are several ways to accomplish this, one of them is described here.
Spurious correlations in big data, how to detect and fix it. You have n = 5,000 variables uniformly distributed on [0,1]. What is the expected number m of correlations that are above p = 0.95. Perform simulations or find theoretical solution. Try with various values of n (from 5,000 to 100,000) and p (from 0.80 to 0.99) and obtain confidence intervals for m (m is a function of n and p). Identify better indicators than correlation to measure whether two time series are really related. The purpose here is twofold: (1) to show that with big data, your strongest correlations are likely to be spurious, and (2) to identify better metrics than correlation in this context. A starting point is my article The curse of big data, also in my book pages 41-45. Or read my article on strong correlations ans answers questions in section 5 and 6.
Robust, simple, multi-usage regression tool for automated data science. The jackknife regression project involves simulated data to create a very robust and simple tool to perform regression and even clustering. We would like to test the Jackknife regression when applying the clustering step, to group variables into 2, 3 or 4 subsets, to see the improvement in the context of predictive modeling. This is described in section 3 in our original article. In particular, we would like to see the improvement when we have a million variables (thus 0.5 trillion correlations) and use sampling techniques to pick up 10 million correlations (using a robust correlation metric) out of these 0.5 trillion, grouping variables using an algorithm identical to our sparse keyword clustering algor…. So, instead of using a 1 million by 1 million correlation table (for the similarity matrix), we would use an hash table of size 10 million, where each entry consists of a pair-value $hash{Var A | Var B}=Corr(A,B). This is 50,000 times more compact than using the full matrix, and nicely exploits sparsity in the data. Then we would like to measure the loss of accuracy by using a sample 50,000 times smaller than the (highly-redundant) full correlation matrix. Click here for details.
Cracking the maths that make all financial transactions secure: Click here for details. A related project consits in designing great non-periodic random number simulators based on digits of irrational numbers. A starting point on this subject is our article Curious formula generating all digits of square root numbers.
Great random number generator: Most random number generators use an algorithm a(k+1) = f(a(k)) to produce a sequence of integers a(1), a(2), etc. that behaves like random numbers. The function f is integer-valued and bounded; because of these two conditions, the sequence a(k) eventually becomes periodic for k large enough. This is an undesirable property, and many public random number generators (those built in Excel, Python, and other languages) are poor and not suitable for cryptographic applications, Markov Chains Monte-Carlo associated with hierarchical Bayesian models, or large-scale Monte-Carlo simulations to detect extreme events (example: fraud detection, big data context). Click here for details about this project.
Solve the Law of Series problem. Why do we get 4 deadly plane crashes in 4 months, and nothing in several years? This is explained by probability laws. Read our article, download our simulations (the password for our Excel spreadsheet is 5150) and provide the mathematical solution, using our numerous hints. This project helps you detect coincidences that are just coincidences, versus those that are not. Useful if you want to specialize in root cause analysis, or data science forensics / litigation.
Zipf’s law. Simulate millions of data points and a moving cluster structure (evolving over 20,000,000 iterations) that mimics a very special kind of cluster process – not unlike the formation of companies or galaxy systems – to prove or disprove my explanation about the origin of these mysterious but widespread Zipf systems. This project will also get you familiar with model fitting techniques, as well as programming in Perl, Java, C++ or Python. Zipf processes are a feature of some big data sets, and usually not found in small data sets. Click here for details. Additional questions: (1) Can the simulation algorithm (read section 4 in the reference article) be adapted to a distributed environment, and how? (2) Find a public data set that perfectly illustrates the Zipf distribution – explain, based on your computations and analyzes, why your selected data set is a great example.

Stochastic Processes

For more recent projects with a more theoretical, probabilistic flavor, yet solved with the help of data science techniques, you can check the following:

If you are looking for data sets, check out this resource.

Good luck!

DSC Resources

Sample Projects for Data Scientists in Training

Leave a Reply Cancel reply