After applying tf-idf weighting, we’re now giving “loved”, “great” and “awful” higher importance relative to “movie”, since “movie” was so common across all text documents. These weights will help our models since they will bias the models to pay attention to what we care about and to ignore less important words.

As bag-of-words is commonly followed by tf-idf weighting, scikit-learn includes a single utility which combines both steps: TfidfVectorizer. We can replace CountVectorizer in our code above with TfidfVectorizer, and we’ll get a tf-idf weighted term-document matrix.

`from sklearn.feature_extraction.text import TfidfVectorizer`

`tfidf = TfidfVectorizer()`

X = tfidf.fit_transform(df.text)

#### Working in High-Dimensional Space

As mentioned above, when using bag-of-words or tf-idf weighting, our term-document matrix has a column for each word in our vocabulary. It’s not unusual to have a vocabulary of tens of thousands of words. This means our term-document matrix will often have tens of thousands of columns or more. Such a wide matrix creates a few problems:

- Wide matrices use a large amount of computer memory.
- Having so many columns can make it difficult to train models that perform well on new, unseen data.

We can partially solve the first issue by being smart about how we store our matrix in memory. Instead of a standard matrix where we store all the values, we use a sparse matrix.In a sparse matrix, only the non-zero values are stored, and everything else is assumed to be zero. This works well for term-document matrices since each text document only uses a small subset of all the words in the vocabulary. In other words, each row in the term-document matrix will be mostly zeros, and the sparse matrix will use significantly less memory than a standard matrix.

Code that was written for standard matrices will often not handle sparse matrices without modification. Fortunately, many of scikit-learn’s algorithms support sparse matrices, making text data much easier to work with in Python. In fact, CountVectorizer and TfidfVectorizer will return sparse matrices by default, and many classifiers and regressors will handle sparse matrix inputs natively.

To address the second issue with high dimensional data (i.e. generalizability of the model), we have a few options. First of all, we can model the wide matrix directly using an algorithm like scikit-learn’s LogisticRegression. This model uses a technique called regularization to pick the columns that are relevant for classification and filter out those that aren’t. In practice, this often works well and is relatively simple to use and implement. However, it doesn’t take advantage of similarities between words.

We can potentially do better by compressing our term-document matrix before we begin to model. Instead of directly modeling a wide dataset, we first reduce the number of dimensions of our data and build a model on the much narrower dataset. This compression is very much like clustering, where a text document is assigned a score for each cluster to measure how associated it is with the cluster.

The intuition behind this idea is that we expect certain words like “enjoy,” “happy,” and “delight” to have similar meanings. Instead of including a column for each word separately, we compress them into a column that measures the broader idea of *enjoyment*.

This compression has the benefit of helping our models learn about related words like “enjoy,” “happy,” and “delight” simultaneously. For instance, we may have many training examples that include the word “enjoy” but only a few that include “delight.” Without compression, the model can’t learn very much about “delight” since there are so few rows to learn from. Yet, if we can squeeze “enjoy” together with “delight,” then everything our model learns about “enjoy” can also be applied to “delight.”

There are many methods to reduce the dimensionality of a term-document matrix. A very common method is to apply singular-value decomposition(SVD) to a tf-idf weighted term-document matrix. This procedure is often called latent semantic analysis (LSA). LSA is frequently used when comparing the similarity of two texts. After transforming two texts with LSA, the cosine distance between them is a good measure of their relatedness.

In text classification (and regression) we can also feed the tf-idf weighted matrix directly into a neural net to perform dimensionality reduction. In this approach, we can think of the first hidden layer as a compression step and the later layers as the classification model. The benefit to this approach — relative to LSA — is that the compression takes into account the labels we are trying to predict. In other words, the compression can ignore anything that isn’t helpful for predicting our labels.

#### Putting It All Together

Let’s combine the methods discussed above with a model to see what a complete text classification workflow looks like. For the model, I’ll be using a multilayer perceptron (MLP) from muffnn, a package open-sourced by Civis. This package wraps the Tensorflow neural network library in a scikit-learn interface, so we can chain our entire text classification process as a single scikit-learn Pipeline. Scikit-learn also has an MLP model, although it doesn’t include as many recent neural network techniques like dropout.

We’ll build our model using a tf-idf weighted term-document matrix as input and an indicator for whether a review was positive or negative (0 or 1) as labels. We’ll perform 5-fold cross-validation and average the out-of-sample accuracies to measure how well our model performs on new, unseen data.

`from muffnn import MLPClassifier`

import numpy as np

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import cross_val_score

from sklearn.pipeline import make_pipeline

`# Read Data. Source:`

# http://archive.ics.uci.edu/ml/machine-learning-databases/00331/sent...

`filename = 'imdb_labelled.txt'`

names = ['text', 'label']

df = pd.read_csv(filename, header=None, names=names, sep='\t', quoting=3)

`# Chain together tf-idf and an MLP with a single hidden layer of size 256`

tfidf = TfidfVectorizer()

mlp = MLPClassifier(hidden_units=(256,))

classifier = make_pipeline(tfidf, mlp)

`# Get cross-validated accuracy of the model`

cv_accuracy = cross_val_score(classifier, df.text, df.label, cv=5)

print("Mean Accuracy: {}".format(np.mean(cv_accuracy)))

Great! This model achieves a respectable accuracy of around 78% across the 5 cross-validation folds. From here we can start to build out a more robust framework for modeling, including tuning our model parameters with grid search and following best practices such as holding out some data as a final validation of the model. After tuning our model, we can likely do better than 78% accuracy, although we probably won’t ever achieve 100% due to inconsistencies in what people consider positive and negative.

We can do a lot with text classifiers, from analyzing customer reviews to filtering spam to detecting fraud. Yet since text data doesn’t come in a ready-to-use spreadsheet format, it can often go unused. This blog post shows a few ways to shape text into a form we can plug in our favorite models. With tools like scikit-learn and Tensorflow, it’s easier than ever to get started.

## You need to be a member of Data Science Central to add comments!

Join Data Science Central