Text Preprocessing and Visualization Made Easy with TextHero

Table of Contents

Introduction

When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc. and sometimes visualization too. After experimenting TextHero on a couple of NLP datasets I found this library to be extremely useful for preprocessing and visualization. This will save us some time writing custom functions. Aren’t you excited!!? So let’s dive in.

We will apply techniques that we are going to learn in this article to Kaggle’s Spooky Author Identification dataset. You can find the dataset here. The complete code is given at the end of the article.

Texthero

Installation

				
					pip install texthero

Preprocessing

As the name itself says clean method is used to clean the text. By default, the clean method applies 7 default pipelines to the text.

fillna(s)
lowercase(s)
remove_digits()
remove_punctuation()
remove_diacritics()
remove_stopwords()
remove_whitespace()

				
					from texthero import preprocessing
df['clean_text'] = preprocessing.clean(df['text'])

We can confirm the default pipelines used with the below code:

Apart from the above 7 default pipelines, TextHero provides many more pipelines that we can use. See the complete list here with descriptions. These are very useful as we deal with all these during text preprocessing.

Based on our requirements, we can also have our custom pipelines as shown below. Here in this example, we are using two pipelines. However, we can use as many pipelines as we want.

				
					from texthero import preprocessing 
custom_pipeline = [preprocessing.fillna, preprocessing.lowercase] 
df['clean_text'] = preprocessing.clean(df['text'], custom_pipeline)

NLP

As of now, this NLP functionality provides only dependency_parse, named_entity, and noun_phrases methods. See the sample code below. I believe, more functionalities will be added later.

Named entity

				
					s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)
print(nlp.named_entities(s)[0])

Output:

				
					[('Narendra Damodardas Modi', 'PERSON', 0, 24),  
('Indian', 'NORP', 31, 37),  
('14th', 'ORDINAL', 64, 68),  
('India', 'GPE', 99, 104),  
('2014', 'DATE', 111, 115)]

Noun phrases

				
					s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)
print(nlp.noun_chunks(s)[0])

Output:

				
					[(‘Narendra Damodardas Modi’, ‘NP’, 0, 24), 
(‘an Indian politician’, ‘NP’, 28, 48), 
(‘the 14th and current Prime Minister’, ‘NP’, 60, 95), 
(‘India’, ‘NP’, 99, 104)]

Representation

This functionality is used to map text data into vectors (Term Frequency, TF-IDF), for clustering (kmeans, dbscan, meanshift) and also for dimensionality reduction (PCA, t-SNE, NMF).

Let’s look at an example with TF-IDF and PCA on the Spooky author identification train dataset.

				
					train['pca'] = (
    train['text']
    .pipe(preprocessing.clean)
    .pipe(representation.tfidf, max_features=1000)
    .pipe(representation.pca)
)
visualization.scatterplot(train, 'pca', color='author', title="Spooky Author identification")

Output:

Visualization

This functionality is used for plotting scatter-plots, word cloud, and also used to get top n words from the text. Refer to the examples below.

Scatter-plot example

				
					train['tfidf'] = (
    train['text']
    .pipe(preprocessing.clean)
    .pipe(representation.tfidf, max_features=1000)
)

train['kmeans_labels'] = (
    train['tfidf']
    .pipe(representation.kmeans, n_clusters=3)
    .astype(str)
)

train['pca'] = train['tfidf'].pipe(representation.pca)

visualization.scatterplot(train, 'pca', color='kmeans_labels', title="K-means Spooky author")

Output:

Wordcloud example

				
					from texthero import visualization
visualization.wordcloud(train['clean_text'])

Output:

Top words example

Code

Conclusion

We have gone thru most of the functionalities provided by TextHero. I found that most of the features are really useful which we can try to use them for the next NLP project for data cleaning/preprocessing and quick visualization of the data.

References

[1]. Texthero · Text preprocessing, representation and visualization from zero to hero.

Chetan Ambi

A Software Engineer & Team Lead with over 10+ years of IT experience, a Technical Blogger with a passion for cutting edge technology. Currently working in the field of Python, Machine Learning & Data Science. Chetan Ambi holds a Bachelor of Engineering Degree in Computer Science.

Text Preprocessing and Visualization Made Easy with TextHero

Introduction

Texthero

Installation

Preprocessing

NLP

Named entity

Noun phrases

Representation

Visualization

Scatter-plot example

Wordcloud example

Top words example

Code

Conclusion

References

Chetan Ambi

Chetan Ambi

Recent Posts

Text Classification using Bag of Words and TF-IDF with TensorFlow

Transfer Learning for Image Classification with TensorFlow

Image Classification using CNN and TensorFlow 2

Follow Us

Follow python simplified

Get In Touch

Subscribe to our Newsletter!

Loaded with interesting and short articles on Python, Machine Learning & Data Science!

Subscribe to our Monthly Newsletter!

Loaded with interesting and short articles on Python, Machine Learning & Data Science!

Follow python simplified

Get In Touch