NLP: Text Cleaning & Preprocessing Methods

NLP, in other words, natural language processing is a convergence between linguistics, computer science, machine learning, and artificial intelligence. It’s a set of algorithms that aims to analyze and model the high volume of human language. In simple words, it acts as an interconnection between human language and computers and provides users with optimum results. 

As we all know computers only understand binary codes 0 and 1 rather than understanding the words. So it converts every other language to its code and processes what we want to search. Therefore a lot of research and development is happening in Natural Language Processing, always. 

Some of the practical implementations of NLP are Microsoft Cortana, Amazon Alexa, Apple Siri, Google Assistant. With smart coding, and applications of ML and AI, they can analyze and understand what questions we raised and can answer within a few seconds. Maybe it is about weather updates, or any news or spam mail filtering is an example of NLP. 

But the applications of NLP is not limited to the above, it has got a big hand in text classification and sentiment analysis, text summarization, along with other classification models. The input data is always in natural ways how humans say – sentences and paragraphs. While processing, it converts human languages into machine-understandable languages by cleaning the variations of the words to their root format and gives them the desired output, NLP plays a crucial role in our everyday’s life. 

For building any machine learning, and artificial intelligence model, data processing is the fundamental step that makes the database cleaner and helps in reducing dimensions. The python library used for doing pre-processing tasks in NLP is NLTK, in other words, Natural Language Toolkit.

The various process involved in NLP are:

Tokenization 

It is the process of converting sentences into words. 
import nltk

from nltk.tokenize import word_tokenize

token = word_tokenize(“My Email address is: imexpert@gmail.com”)

token

Lowercasing 

It converts the tokenized words into the lowercase format.  The words have the same meaning as ‘nlp’ and’ NLP. And if they are not converted into lowercase, then both will be considered as non-identical words in vector space models. 
Lowercase = []

for lowercase in token:

Lowercase.append(lowercase.lower())

Lowercase

Stop Words Removal 

This concept mostly gets used when you do not have significance while determining two different documents such as (a, an, the, etc) so that they can be removed.

from nltk.corpus import stopwords

stop_words = stopwords.words(‘english’)

from string import punctuation

punct = list(punctuation)

print(dataset[1][‘quote’])

tokens = word_tokenize(dataset[1][‘quote’])

len(tokens)

Stemming

It is the advanced process in which the words get converted to their base form. In simple words, when it sees a variety of words having a common root term, it considers them all the same. 

from nltk.stem import PorterStemmer

ps = PorterStemmer()

print(ps.stem(‘jumping’))

print(ps.stem(‘lately’))

print(ps.stem(‘assess’))

print(ps.stem(‘ran’))

Lemmatization 

The major difference between stemming and lemmatization is, lemmatization lowers the words to the root words in the present language. 

For example, the word has and is changed to ha and be. 

from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize(‘ran’, ‘v’))

print(lemmatizer.lemmatize(‘better’, ‘a’))

Add a Comment

Your email address will not be published. Required fields are marked *