Scikit is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Here are 5 case studies using SciKit Learn specifically for text & document classification.
News Classification for Startup Intelligence: CB Insights, a startup intelligence data provider, shows an example of classifying news into HR & employee related classifications. CB Insights, a startup intelligence data provider, assessment of private company health includes tracking of their human resources activities. This includes programmatic monitoring of hiring activity as evidenced by job postings & key hires and departures. They used Sci-Kit learn to help in their activities. Human Resources classification is binary classification problem in the sense that the news should be able to discriminate human resources events that for companies from the all other news. The classification problem had 5 non-orthogonal components: 1) Preprocessing and Feature Extraction (Document Representation) 2) Feature Selection 3) Classification 4) Evaluation and comparison of different classifiers 5) Choose the best classifier for a given measure(Classification accuracy, F_beta score, Precision or Recall)
News Classification for Investing: This article by Quantstart includes a tutorials on how to carry out natural language document classification, for the purposes of sentiment analysis and, ultimately, automated trade filter or signal generation. This particular article will make use of Support Vector Machines (SVM) to classify text documents into mutually exclusive groups.
Web page Classification: Scraping Hub uses succinct tries to optimize the memory usage of Scikit models by changing the model, using simpler features, doing feature selection, changing the classifier to a less memory intensive one, using simpler preprocessing steps, etc.
Email Spam Classification: Zac Stewart shows email spam classification using Sci-Kit learn Document classification. The data set is a combination of combination of the Enron-Spam (in raw form) data sets and the SpamAssassin public corpus. Both are publicly available for download. The project starts with raw, labeled emails, and ends with a working, reasonable accurate spam filter.
Matching user profiles to music listener profiles: IBM shares a case study using SciKit, to build and apply a model to simulated customer product purchase histories. In a sample scenario, a model assigns music-listener profiles to individual customers, based on the specific products each customer purchases and the corresponding textual product descriptions.