Bag Of Words Using Python

Introduction to Bag of Words

Bag of Words model is that the technique of pre-processing the text by converting it into a number/vector format, which keeps a count of the entire occurrences of most often used words within the document. This model is especially visualized employing a table, which contains the count of words like the word itself. In other words, it are often explained as a way to extract features from text documents and use these features for training machine learning algorithms. It tends to make a vocabulary of all the unique words occurring within the training set of the documents.


We shall be taking a well-liked example to elucidate Bag-of-Words (BoW) and make this journey of understanding a far better one. We all love doing online shopping, and yes, it’s always important to seem for reviews for a product before we plan to buying it. So, we’ll use this instance here.

Here’s a sample of reviews a few particular cosmetic product:

Review 1: This product is beneficial and fancy
Review 2: This product is beneficial but not trending
Review 3: This product is awesome and fancy
We can actually observe a 100 such contrasting reviews about the merchandise also as its features, and there’s tons of interesting insights we will draw from it, and eventually predict which product is best for us to shop for .

Now, the essential requirement is to process the text and convert it into vectorized form. And this will be easily done using Bag of Words which is that the simplest sort of text representation in numbers.


But this word embedding technique has some pitfalls thanks to which developers prefer using TF-IDF or word2vec when handling an outsized amount of knowledge .

Let’s discuss them:

First issue arises in cases when the new sentences contain new words. If such happens, then the vocabulary size would increase and thereby, the length of the vectors would increase too.
Additionally, the vectors would also contain many 0s, thereby leading to a sparse matrix (which is what we might wish to avoid)
Secondly, we are gaining no information about the grammatical section nor are we focussing on the order of words in text.

Resource Article :

Add a Comment

Your email address will not be published. Required fields are marked *