Text (word) analysis and tokenized text modeling always give a chill air around ears, specially when you are new to machine learning. Thanks to Python and its extended libraries for its warm support around text analytics and machine learning. Scikit-learn is a savior and excellent support in text processing when you also understand some of the concept like "Bag of word", "Clustering" and "vectorization". Vectorization is must-to-know technique for all machine leaning learners, text miner and algorithm implementor. I personally consider it as a revolution in the analytical calculations. Read one of my earlier post about vectorization. Let's look at the implementors of vectorization and try to zero down the process of text analysis.

Fundamentally, before we start any text analysis we need to first tokenize every word in a given text, so we can apply mathematical model on these words. When we actually tokenize the text, it can be transform into {*bag of words*} model of document classification. This {b*ag of word*} model is used as a feature to train classifiers. We'll observe in code how the **feature** and **classifier** term can be explored and implemented using Scikit-learn. But before that let us explore how to tokenize and bring the text into a Vector shape. So the {*bag of words*} representation will go with 3 step process: tokenizing, counting and finally normalizing the vector.

**Tokenizing:**Tokenize strings and giving an integer id for each possible token.**Counting:**Once tokenized then count the occurrences of tokens in each document.**Normalizing**and weighting with diminishing importance tokens that occur in the majority of samples / documents.

* This below code will need Python-2.7 or above, Numpy-1.3 above and scikit-learn-0.14. Obviously all these happen on Ubuntu-12.04 LTS.

Scikit's functions and classes are imported via the sklearn package as follows:

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> vectorizer = CountVectorizer(min_df=1)</code snippet>

Here we do not have to write a custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer does the job very efficiently. It also has a very convenient interface. The parameter min_df determines how CountVectorizer treats words that are not used frequently (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur less than that fraction of the overall dataset will be dropped. The parameter **max_df** works in a similar manner. Once we vectorize the posts using feature vector functionality we'll have 2 simple vector. We can then simply calculate the Euclidean distance between these two vector and calculate the nearest one to identify *similarities*. This is nothing but step towards clustering/classification of *similar* posts.

Hold-on we haven't reached to the phase of implementing clustering algorithms. We need to cautiously move with below steps towards bringing our raw text to a more meaningful {*bag of words*}. We also try to correlate some of the technical terms in **blue** with every steps:

- Tokenizing the text. -- Vectorization and tokenizing
- Throw away some less important words. -- stop word
- Throwing away words that occur way too often to be of any help in detecting relevant posts. -- stemming
- Throwing away words that occur so seldom that there is only a small chance that they occur in future posts.
- Counting the remaining words.
- Calculating TF-IDF values from the counts, considering the whole text corpus. -- calculate TF-IDF

With this process, we'll able to convert a bunch of noisy text into a concise representation of feature values. Hopefully, you're familiar with the term TF-IDF. If not, then below explanation will help to build understanding around TF-IDF:

When we use feature extraction and vectorized the text then this feature values simply count occurrences of terms in a post. We silently assumed that higher values for a term also mean that the term is of greater importance to the given post. But what about, for instance, the word "subject", which naturally occurs in each and every single post? Alright, we could tell CountVectorizer to remove it as well by means of its max_df parameter. We could, for instance, set it to 0.9 so that all words that occur in more than 90 percent of all posts would be always ignored. But what about words that appear in 89 percent of all posts? How low would we be willing to set max_df? The problem is that however we set it, there will always be the

problem that some terms are just more discriminative than others. This can only be solved by counting term frequencies for every post, and in addition, discounting those that appear in many posts. In other words, we want a high value for a given term in a given value if that term occurs often in that particular post and very rarely anywhere else. This is exactly what term frequency – inverse document frequency (TF-IDF)

So, continue to the previous code where we have imported CountVectorizer library to vectorize and tokenized the text and in below example we are going to compare "Big Data Hype" term with 2 different posts published about "Hype" of "Big Data". To do this we first need to vectorized the posts in question (new post) and then get the third post vectorized using the same method of scikit. Once we have vectors then we can calculate the distance of the new post. This code snippet **ONLY** covers vectorizing and tokenizing the text.

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> content = ["Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns","the real solutions that are useful in dealing with Big Data will be needed and in demand even if the notion of Big Data falls from the height of its hype into the trough of disappointment"]

>>> X = vectorizer.fit_transform(content)

>>> vectorizer = CountVectorizer(min_df=1)>>> print(vectorizer)

CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,

decode_error=strict, dtype=<type 'numpy.int64'>, encoding=utf-8,

input=content, lowercase=True, max_df=1.0, max_features=None,

min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,

strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,

vocabulary=None)>>> vectorizer.get_feature_names()

[u'about', u'and', u'appreciating', u'are', u'be', u'big', u'bubble', u'bursting', u'certain', u'data', u'dealing', u'demand', u'disappointment', u'even', u'falls', u'from', u'height', u'hype', u'if', u'in', u'into', u'its', u'needed', u'notion', u'nuances', u'of', u'patterns', u'products', u'real', u'solutions', u'starts', u'that', u'the', u'trough', u'useful', u'will', u'with']>>> X_train = vectorizer.fit_transform(content)

>>> num_samples, num_features = X_train.shape

>>> print("#samples: %d, #features: %d" % (num_samples, num_features)) #samples: 5, #features: 25

#samples: 2, #features: 37>>> vectorizer = CountVectorizer(min_df=1, stop_words='english')

..............

</code snippet>

I would highly recommend the book "Building machine learning system with python" on Packtpub or on Amazon

Original post:

http://datumengineering.wordpress.com/2013/09/26/python-scikit-lear...

- Juniper adds Mist AIOps to its 128 Technology-based SD-WAN
- 10 microservices patterns all architects should know
- IBM extends Call for Code for Racial Justice program
- citizen development
- How to manage third-party risk in the supply chain
- Gartner predicts data storytelling will dominate BI by 2025
- AWS Data Exchange and the third-party cloud data marketplace
- Overcome common IoT edge computing architecture issues

Posted 1 March 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central