Text Classification using Bag of Words and TF-IDF with TensorFlow

Table of Contents

Introduction

Text Classification is one of the most popular and widely used use-cases of Machine Learning and NLP (Natural Language Processing). It can be used to classify spam vs non-spam emails, classify business documents into different categories, and sentiment analysis of Tweets, etc.

In this practical guide, you’ll understand how to use Bag of Words and TF-IDF for text classification with TensorFlow. The advanced text feature extraction methods such Word2Vec, GloVe, fastText, etc. will be covered in future articles.

We’ll be using the IMDb Dataset of 50,000 movie reviews from Kaggle. You can download the dataset from here. Otherwise, you can create a new notebook in Kaggle and follow along.

The code accompanying this article can also be found on GitHub and Kaggle.

Quick Recap

Before we delve into coding, let’s take a quick recap of how text features are extracted in Bag of Words and TF-IDF methods –

Bag of words (CountVectorizer): Each word in the collection of text documents is represented with its count in the matrix form. Refer below –

TF-IDF: Each word from the collection of text documents is represented in the matrix form with TF-IDF (Term Frequency Inverse Document Frequency) values. Refer below —

You probably would have used it with Scikit-learn. In this blog, you’ll implement both methods directly in TensorFlow. Let’s get started by importing all the libraries.

Import libraries

Imports all the required libraries and modules.

				
					import os
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import losses
from tensorflow.keras.utils import plot_model
from tensorflow.keras import Sequential
print("Tensorflow version:", tf.__version__)

from sklearn.model_selection import train_test_split

Import dataset and EDA

The dataset for this article comes from Kaggle. The dataset contains 50,000 reviews. The positive and negative comments are equally distributed (balanced classes). Since we are working with a balanced dataset, we don’t have to worry about class imbalance issues.

				
					df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

df['sentiment'].value_counts().plot(kind='bar')

A quick look at the first 2 reviews and its sentiment —

				
					df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

Data cleaning is an important step in the NLP pipeline. Though we are skipping this step here to keep this article short, we strongly suggest you apply the necessary data-cleaning techniques such as punctuation removal, stop-word removal, lemmatization, etc.

However, note that the TextVectorization method we are using for text feature extraction in the next section applies the default text cleaning and can optionally accept a custom function for data cleaning.

Train test split

The next step is to split the data into train and validation sets..For the given dataset we only have one file IMDB Dataset.csv. So, let’s split the dataset into the train, validation, and test set.

				
					train, test = train_test_split(df, test_size=0.2, stratify=df['sentiment'])
train, validation = train_test_split(train, test_size=0.2, stratify=train['sentiment'])

train.shape, validation.shape, test.shape

Next, create TensorFlow dataset objects for training and validation dataset using the method from_tensor_slices() as shown below.

				
					training_dataset = tf.data.Dataset.from_tensor_slices(
    (train['review'].values, train['sentiment'].values)).batch(batch_size=32)

validation_dataset = tf.data.Dataset.from_tensor_slices(
    (validation['review'].values, validation['sentiment'].values)).batch(batch_size=32)

Text feature extraction

As mentioned earlier, CountVectorizer and TF-IDF are the commonly used text feature extraction methods. We can achieve this directly on the pandas dataframe using Scikit-learn’s CountVectorizer and TfidfVectorizer.

But, we’ll use TensorFlow provided TextVectorization method to implement Bag of Words and TF-IDF. By setting the parameter output_mode to count and tf-idf and we get Bag of Words and TF-IDF outputs respectively. Note that the output_mode also takes two more parameters — int and multi-hot that we are not covering in this article.

The advantage of using the TextVectorization method is that you can add this vectorization as another layer when building the model which you’ll see in the next section.

The syntax for TextVectorization –

				
					tf.keras.layers.TextVectorization(
    max_tokens=None,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    ngrams=None,
    output_mode='int',
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    idf_weights=None,
    sparse=False,
    ragged=False,
    encoding='utf-8',
    **kwargs
)

max_tokens — the maximum length of the vocabulary. This must be used if pad_to_max_tokens is set to True meaning if the size of the string is less than max_tokens the remaining characters are padded with zero.
standardize — denotes how to clean the text. The default value is lower_and_strip_punctuation i.e. text is converted to lower case and then all punctuations will be removed. The other options are — lower, strip_punctuation and None. You can also write a custom standardization function and pass the function name to this.
split — used to control how to split the text. The values are — None, whitespace and character. The default is whitespace.
ngram — ngram range. The possible values are — None, integer or (integer, integer). If an integer is passed all ngrams up to that integer are considered. If the tuple of integers is passed then only ngram in that range is considered.
output_mode — controls the output. The possible values are — int, multi-hot, count, tf-idf.
output_sequence_length — Used only if output_mode is set to int. This means that the output will be either padded or truncated so as to have the length as specified output_sequence_length.
pad_to_max_tokens — Used only if output_mode is multi_hot, count, and tf_idf .

Bag of Words (Count Vectorizer): To use the Bag of Words method, we need to set output_mode to count. The below code gives Bag of Words output and the maximum vocabulary size is set to 10000.

				
					vocab_size = 10000

count_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='count'
)

count_vectorizer.adapt(train['review'].values)

TF-IDF: To use the TF-IDF text extraction method set output_mode to tf-idf. The below code gives TF-IDF output and the maximum vocabulary size is set to 10000. Also, we are using the ngram range (2, 3).

				
					vocab_size = 10000       

tfidf_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='tf-idf',
    ngrams=(2, 3)
)

tfidf_vectorizer.adapt(train['review'].values)

After creating the TextVectorization objects, you should run the adapt method on the train set. This is similar to the fit method in Scikit-learn. The adapt method learns the vocabulary and stores it in the object. You can view the learned vocabulary using the get_vocabulary() method.

By providing the sample text to TextVectorizer objects you can see the output Bag of Words or TF-IDF, etc. The same thing happens when you are trying to predict the sentiment of the text. Notice in the next section that the vectorizer is added as the first layer in the model so that input text is vectorized first and then feed into dense layers and finally the prediction layer.

Model

Now we have everything ready to build the neural network model. In this section, we will build a separate model for each Bag of Words and TF-IDF, compile it, and finally train the model. Notice that the only difference between the models is the first layer i.e count_vectorizer in Bag of Words and tfidf_vectorizer in the TF-IDF model.

Bag of Words model

				
					model = Sequential([
    count_vectorizer,
    layers.Dense(64, activation='relu', kernel_regularizer=L2(1e-3)),
    layers.Dropout(0.1),
    layers.Dense(64, activation='relu', kernel_regularizer=L2(1e-3)),
    layers.Dropout(0.1),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
)

history = model.fit(
    training_dataset,
    steps_per_epoch=len(training_dataset),
    epochs=5,
    validation_data=validation_dataset,
    validation_steps=len(validation_dataset)
)

TF-IDF model

				
					model = Sequential([
    tfidf_vectorizer,
    layers.Dense(64, activation='relu', kernel_regularizer=L2(1e-3)),
    layers.Dropout(0.1),
    layers.Dense(64, activation='relu', kernel_regularizer=L2(1e-3)),
    layers.Dropout(0.1),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
)

history = model.fit(
    training_dataset,
    steps_per_epoch=len(training_dataset),
    epochs=5,
    validation_data=validation_dataset,
    validation_steps=len(validation_dataset)
)

Prediction

We have tried 2 different models based on Bag of Words and TF-IDF. The Bag of Words model gave us the best accuracy. Let’s get predictions on unseen or test data using the Bag of Words model. Since we have true labels for test data, find out the score for test data as well.

				
					predictions = model_1.predict(test['review']).round()
accuracy_score(test['sentiment'].values, predictions)

In this article, we have built a simple sentiment analysis using Bag of Words and TF-IDF methods with TensorFlow. We achieved a decent accuracy score of ~86% accuracy on validation and test data from the Bag of Words model. You can further improve the model with different techniques — tuning TextVectorizer parameters (vocab size, etc.), regularization, dropout %, tuning hyperparameters of the model using KerasTuner, etc.

References

https://www.tensorflow.org/tutorials
https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization

Chetan Ambi

A Software Engineer & Team Lead with over 10+ years of IT experience, a Technical Blogger with a passion for cutting edge technology. Currently working in the field of Python, Machine Learning & Data Science. Chetan Ambi holds a Bachelor of Engineering Degree in Computer Science.

Text Classification using Bag of Words and TF-IDF with TensorFlow

Introduction

Quick Recap

Import libraries

Import dataset and EDA

Train test split

Text feature extraction

Model

Bag of Words model

TF-IDF model

Prediction

References

Chetan Ambi

Chetan Ambi

Recent Posts

Text Classification using Bag of Words and TF-IDF with TensorFlow

Transfer Learning for Image Classification with TensorFlow

Image Classification using CNN and TensorFlow 2

Follow Us

Follow python simplified

Get In Touch

Subscribe to our Newsletter!

Loaded with interesting and short articles on Python, Machine Learning & Data Science!

Subscribe to our Monthly Newsletter!

Loaded with interesting and short articles on Python, Machine Learning & Data Science!

Follow python simplified

Get In Touch