Python Simplified

PythonSimplifiedcomLogo

Text Preprocessing and Visualization Made Easy with TextHero

Introduction

When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc. and sometimes visualization too. After experimenting TextHero on a couple of NLP datasets I found this library to be extremely useful for preprocessing and visualization. This will save us some time writing custom functions. Aren’t you excited!!? So let’s dive in.

We will apply techniques that we are going to learn in this article to Kaggle’s Spooky Author Identification dataset. You can find the dataset here. The complete code is given at the end of the article.

Texthero

Installation

				
					pip install texthero
				
			

Preprocessing

As the name itself says clean method is used to clean the text. By default, the clean method applies 7 default pipelines to the text.

  1. fillna(s)
  2. lowercase(s)
  3. remove_digits()
  4. remove_punctuation()
  5. remove_diacritics()
  6. remove_stopwords()
  7. remove_whitespace()
				
					from texthero import preprocessing
df['clean_text'] = preprocessing.clean(df['text'])
				
			

We can confirm the default pipelines used with the below code:

texthero example 1

Apart from the above 7 default pipelines, TextHero provides many more pipelines that we can use. See the complete list here with descriptions. These are very useful as we deal with all these during text preprocessing.

Based on our requirements, we can also have our custom pipelines as shown below. Here in this example, we are using two pipelines. However, we can use as many pipelines as we want.

				
					from texthero import preprocessing 
custom_pipeline = [preprocessing.fillna, preprocessing.lowercase] 
df['clean_text'] = preprocessing.clean(df['text'], custom_pipeline)
				
			

NLP

As of now, this NLP functionality provides only dependency_parse, named_entity,  and noun_phrases methods. See the sample code below. I believe, more functionalities will be added later.

Named entity

				
					s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)
print(nlp.named_entities(s)[0])
				
			

Output:

				
					[('Narendra Damodardas Modi', 'PERSON', 0, 24),  
('Indian', 'NORP', 31, 37),  
('14th', 'ORDINAL', 64, 68),  
('India', 'GPE', 99, 104),  
('2014', 'DATE', 111, 115)]
				
			

Noun phrases

				
					s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)
print(nlp.noun_chunks(s)[0])
				
			

Output:

				
					[(‘Narendra Damodardas Modi’, ‘NP’, 0, 24), 
(‘an Indian politician’, ‘NP’, 28, 48), 
(‘the 14th and current Prime Minister’, ‘NP’, 60, 95), 
(‘India’, ‘NP’, 99, 104)]
				
			

Representation

This functionality is used to map text data into vectors (Term Frequency, TF-IDF), for clustering (kmeans, dbscan, meanshift) and also for dimensionality reduction (PCA, t-SNE, NMF).

Let’s look at an example with TF-IDF and PCA on the Spooky author identification train dataset.

				
					train['pca'] = (
    train['text']
    .pipe(preprocessing.clean)
    .pipe(representation.tfidf, max_features=1000)
    .pipe(representation.pca)
)
visualization.scatterplot(train, 'pca', color='author', title="Spooky Author identification")
				
			

Output:

texthero example 2

Visualization

This functionality is used for plotting scatter-plotsword cloud, and also used to get top n words from the text. Refer to the examples below.

Scatter-plot example

				
					train['tfidf'] = (
    train['text']
    .pipe(preprocessing.clean)
    .pipe(representation.tfidf, max_features=1000)
)

train['kmeans_labels'] = (
    train['tfidf']
    .pipe(representation.kmeans, n_clusters=3)
    .astype(str)
)

train['pca'] = train['tfidf'].pipe(representation.pca)

visualization.scatterplot(train, 'pca', color='kmeans_labels', title="K-means Spooky author")
				
			

Output:

texthero example 3

Wordcloud example

				
					from texthero import visualization
visualization.wordcloud(train['clean_text'])
				
			

Output:

texthero example 4

Top words example

texthero example 5

Code

Conclusion

We have gone thru most of the functionalities provided by TextHero. I found that most of the features are really useful which we can try to use them for the next NLP project for data cleaning/preprocessing and quick visualization of the data.

References

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Share on email
Chetan Ambi

Chetan Ambi

A Software Engineer & Team Lead with over 10+ years of IT experience, a Technical Blogger with a passion for cutting edge technology. Currently working in the field of Python, Machine Learning & Data Science. Chetan Ambi holds a Bachelor of Engineering Degree in Computer Science.
Scroll to Top