Stemming and Lemmatization in Python NLTK with Examples
⚡ Smart Summary
Stemming and Lemmatization are text normalization techniques in Python NLTK that reduce word variations to a common base, where stemming crops suffixes quickly without context and lemmatization returns a true dictionary word using meaning.

What is Stemming and Lemmatization in Python NLTK?
Stemming and Lemmatization in Python NLTK are text normalization techniques for Natural Language Processing. These techniques are widely used for text preprocessing. The difference between stemming and lemmatization is that stemming is faster because it cuts words without knowing the context, while lemmatization is slower because it knows the context of words before processing.
What is Stemming?
Stemming is a method of normalizing words in Natural Language Processing. It is a technique in which a set of words in a sentence are converted into a sequence to shorten the lookup. In this method, the words that have the same meaning but some variations according to the context or sentence are normalized. In other words, there is one root word, but there are many variations of the same word. For example, the root word is “eat” and its variations are “eats, eating, eaten, and so on.” In the same way, with the help of Stemming in Python, we can find the root word of any variation.
For example:
He was riding. He was taking the ride.
In the above two sentences, the meaning is the same, that is, a riding activity in the past. A human can easily understand that both meanings are the same. But for machines, both sentences are different. Thus it becomes hard to convert them into the same data row. If we do not provide the same dataset, then the machine fails to predict. So it is necessary to differentiate the meaning of each word to prepare the dataset for machine learning. Here stemming is used to categorize the same type of data by getting its root word.
Let us implement this with a Python program. NLTK has an algorithm named PorterStemmer. This algorithm accepts the list of tokenized words and stems them into root words.
Program for Understanding Stemming
from nltk.stem import PorterStemmer e_words= ["wait", "waiting", "waited", "waits"] ps =PorterStemmer() for w in e_words: rootWord=ps.stem(w) print(rootWord)
Output:
wait wait wait wait
Code Explanation:
- There is a stem module in NLTK which is imported. If you import the complete module, the program becomes heavy as it contains thousands of lines of code. So from the entire stem module, we only imported “PorterStemmer.”
- We prepared a dummy list of variation data of the same word.
- An object is created that belongs to the class nltk.stem.porter.PorterStemmer.
- We passed it to PorterStemmer one by one using a “for” loop. Finally, we got the output root word of each word mentioned in the list.
From the above, stemming is an important preprocessing step because it removes redundancy and variations in the same word. As a result, data is filtered, which helps in better machine training. Now we pass a complete sentence and check its output.
from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize, word_tokenize sentence="Hello Guru99, You have to build a very good site and I love visiting your site." words = word_tokenize(sentence) ps = PorterStemmer() for w in words: rootWord=ps.stem(w) print(rootWord)
Output:
hello
guru99
,
you
have
build
a
veri
good
site
and
I
love
visit
your
site
Code Explanation:
- The package PorterStemmer is imported from the module stem.
- Packages for tokenization of sentences as well as words are imported.
- A sentence is written which is to be tokenized in the next step.
- An object for PorterStemmer is created here.
- A loop is run and stemming of each word is done using the object created in code line 5.
In short, stemming is a data-preprocessing module. The English language has many variations of a single word, which create ambiguity in machine learning training and prediction. To build a successful model, it is vital to filter such words into the same sequenced data using stemming. This is also known as normalization.
What is Lemmatization?
Lemmatization in NLTK is the algorithmic process of finding the lemma of a word depending on its meaning and context. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, known as the lemma. The NLTK Lemmatization method is based on WordNet’s built-in morph function. Text preprocessing includes both stemming and lemmatization. Many people find the two terms confusing. Some treat these as the same, but there is a difference between stemming and lemmatization. Lemmatization is preferred over the former for the reason below.
Why is Lemmatization better than Stemming?
The stemming algorithm works by cutting the suffix from the word. In a broader sense, it cuts either the beginning or the end of the word. On the contrary, Lemmatization is a more powerful operation, and it takes into consideration the morphological analysis of the words. It returns the lemma, which is the base form of all its inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. Stemming is a general operation, while lemmatization is an intelligent operation where the proper form is looked up in the dictionary. Hence, lemmatization helps in forming better machine learning features.
Code to distinguish between Lemmatization and Stemming
Stemming Code:
import nltk from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() text = "studies studying cries cry" tokenization = nltk.word_tokenize(text) for w in tokenization: print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))
Output:
Stemming for studies is studi Stemming for studying is studi Stemming for cries is cri Stemming for cry is cri
Lemmatization Code:
import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() text = "studies studying cries cry" tokenization = nltk.word_tokenize(text) for w in tokenization: print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))
Output:
Lemma for studies is study Lemma for studying is studying Lemma for cries is cry Lemma for cry is cry
Discussion of Output
If you look at stemming for studies and studying, the output is the same (studi), but the NLTK lemmatizer provides a different lemma for both tokens: study for studies and studying for studying. So when we need to make a feature set to train a machine, it would be great if lemmatization is preferred.
Use Case of Lemmatizer
The Lemmatizer minimizes text ambiguity. Words like bicycle or bicycles are converted to the base word bicycle. It converts all words with the same meaning but different representation to their base form, reducing word density and helping prepare accurate features for training a machine. The cleaner the data, the more accurate your machine learning model will be. The NLTK Lemmatizer also saves memory and computational cost.
Real-time example showing use of WordNet Lemmatization and POS Tagging in Python:
from nltk.corpus import wordnet as wn from nltk.stem.wordnet import WordNetLemmatizer from nltk import word_tokenize, pos_tag from collections import defaultdict tag_map = defaultdict(lambda : wn.NOUN) tag_map['J'] = wn.ADJ tag_map['V'] = wn.VERB tag_map['R'] = wn.ADV text = "guru99 is a totally new kind of learning experience." tokens = word_tokenize(text) lemma_function = WordNetLemmatizer() for token, tag in pos_tag(tokens): lemma = lemma_function.lemmatize(token, tag_map[tag[0]]) print(token, "=>", lemma)
Output:
guru99 => guru99 is => be totally => totally new => new kind => kind of => of learning => learn experience => experience . => .
Code Explanation:
- The corpus reader wordnet is imported, and WordNetLemmatizer is imported from wordnet.
- Word tokenize and parts of speech tag are imported from nltk, and Default Dictionary from collections.
- A dictionary is created where the first letter of pos_tag is the key, mapped to the value from the wordnet dictionary.
- The text is written and tokenized, and the object lemma_function is created for use inside the loop.
- The loop is run, and lemmatize takes two arguments: the token and a mapping of pos_tag with the wordnet value.
Python Lemmatization has a close relation with the WordNet dictionary, so it is essential to study that topic next.
