Part 1: NLP Basics: Demystifying AI For Everyone
Let’s get started in the age of ChatGPT. Over the years, humans have developed effective ways to communicate with one another. Speech is one of the most common ways to communicate. We communicate with one another using different languages, such as English, German French, Hindi, French, and German.
Alexandra on UnsplashNatural Language Processing is one component of Artificial Intelligence that aids Computers to understand and process human speech.
NLP is used to create language models that machines can understand, similar to human languages. Ex:- Chat GPT-3, the third generation OpenAI’s Generative Pretrained Transformer language model, is an example.
Why do we care about NLP? NLP is something that we all use in our daily lives, whether we know it or not.
Ever wonder how auto-correction suggestions are generated when you type messages? Or how google lens interprets the words in images?
NLP powers everything. Let’s take a look at a few examples of NLP applications.
Natural Language Processing (NLP), uses cases: Sentiment Analysis: This is the analysis of the speaker’s sentiment.
Photo by Nik at UnsplashEx – An analysis of customer reviews and tweets to find out what customers think about a company’s products.
Document Summarization: This is used for a summary of large blocks of text
Ex:- Summary of customer feedback or Book summary
Language Translation: Translate between one language and another
Ex:- English to Japanese, or vice versa.
Text-to–text and Speech-to–text: These can be used to transcribe audio or text, or vice versa. For further processing, the transcribed text can be sent to the computers.
Ex:- Amazon Alexa
There are many other uses, so I hope you get an idea of some.
Let’s now look at how computers understand text data.
Photo by Andrea De Santis at UnsplashComputers only understand binary information. 1, or 0, short for numerical information.
We need to convert text data into numerical format in order to feed it into NLP machine learning models.
However, it is not enough to convert text into numbers. To clean and organize the text data, we need to do some work.
These are the steps used in the text preprocessing pipeline. Some steps can be skipped depending on the context of a problem.
Remove white spaces (extra spaces in the text, these are present due to formatting issues)Remove punctuationsRemove numbersRemove stop words (common words which won’t give much information as they are present in all documents Ex:- a, an, of, the, etc…)Remove symbols (Ex:- @, <, $, %, etc...)Lowercase all wordsPerform stemming/lemmatization on all words (Ex:- Runs, Running, Run all become run)As I mentioned earlier, this is just an example of a standard general preprocessing pipeline, this should be customized project to a project basis.
After this, we must Tokenize the documents. Tokenisation refers to the process of breaking down text documents into smaller pieces of words.
Now our input data will look like this: Every word becomes a column and every sentence (sentence), is a row
This input is used to vectorize.
Vectorization simply means that words are converted into vector formats so computers can understand them.
Voila! You have now understood the basics of NLP, or, as I might put it, the core.
There are several vectorization methods:
Bag of Words (BOW),TFIDFWord EmbedddingsThis topic will need a full article so I will address it in the next article.
We hope you enjoyed the post. I tried to make it simple.
Libraries take care of all the above-mentioned steps, so you don’t have to code anything.
When I started learning NLP, I was afraid of everything. It was easy once I started to take an interest.
Keep learning and taking small steps towards NLP. If you’re willing to put in the effort, nothing will be difficult.
All the best on your journey. All the best for your journey.
————————————————————————————————————————————————————————————
By: Himanshu Joshi
Title: Demystifying AI for everyone: Part 1 -NLP Basics
Sourced From: medium.com/@himanshujoshi_67631/acc30c1f2c70
Leave a Reply