Toggle Menu Close

⚡ Smart Summary

Stemming and Lemmatization are text normalization techniques in Python NLTK that reduce word variations to a common base, where stemming crops suffixes quickly without context and lemmatization returns a true dictionary word using meaning.

🌱 Stemming: The PorterStemmer crops suffixes to reach a root that may not be a real word.
📚 Lemmatization: The WordNetLemmatizer returns the dictionary base form, called the lemma.
⚖️ Difference: Stemming is faster, lemmatization is slower but more accurate and context-aware.
🔤 Normalization: Both shrink word density so machines treat variants as one feature.
🏷️ POS Tags: Passing a part-of-speech tag makes the lemmatizer far more precise.
🤖 AI Benefit: Cleaner normalized text produces stronger machine learning features and models.

What is Stemming and Lemmatization in Python NLTK?

Stemming and Lemmatization in Python NLTK are text normalization techniques for Natural Language Processing. These techniques are widely used for text preprocessing. The difference between stemming and lemmatization is that stemming is faster because it cuts words without knowing the context, while lemmatization is slower because it knows the context of words before processing.

What is Stemming?

Stemming is a method of normalizing words in Natural Language Processing. It is a technique in which a set of words in a sentence are converted into a sequence to shorten the lookup. In this method, the words that have the same meaning but some variations according to the context or sentence are normalized. In other words, there is one root word, but there are many variations of the same word. For example, the root word is “eat” and its variations are “eats, eating, eaten, and so on.” In the same way, with the help of Stemming in Python, we can find the root word of any variation.

For example:

He was riding.
He was taking the ride.

In the above two sentences, the meaning is the same, that is, a riding activity in the past. A human can easily understand that both meanings are the same. But for machines, both sentences are different. Thus it becomes hard to convert them into the same data row. If we do not provide the same dataset, then the machine fails to predict. So it is necessary to differentiate the meaning of each word to prepare the dataset for machine learning. Here stemming is used to categorize the same type of data by getting its root word.

Let us implement this with a Python program. NLTK has an algorithm named PorterStemmer. This algorithm accepts the list of tokenized words and stems them into root words.

Program for Understanding Stemming

from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)

Output:

wait
wait
wait
wait

Code Explanation:

There is a stem module in NLTK which is imported. If you import the complete module, the program becomes heavy as it contains thousands of lines of code. So from the entire stem module, we only imported “PorterStemmer.”
We prepared a dummy list of variation data of the same word.
An object is created that belongs to the class nltk.stem.porter.PorterStemmer.
We passed it to PorterStemmer one by one using a “for” loop. Finally, we got the output root word of each word mentioned in the list.

From the above, stemming is an important preprocessing step because it removes redundancy and variations in the same word. As a result, data is filtered, which helps in better machine training. Now we pass a complete sentence and check its output.

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello Guru99, You have to build a very good site and I love visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
	rootWord=ps.stem(w)
	print(rootWord)

Output:

hello
guru99
,
you
have
build
a
veri
good
site
and
I
love
visit
your
site

Code Explanation:

The package PorterStemmer is imported from the module stem.
Packages for tokenization of sentences as well as words are imported.
A sentence is written which is to be tokenized in the next step.
An object for PorterStemmer is created here.
A loop is run and stemming of each word is done using the object created in code line 5.

In short, stemming is a data-preprocessing module. The English language has many variations of a single word, which create ambiguity in machine learning training and prediction. To build a successful model, it is vital to filter such words into the same sequenced data using stemming. This is also known as normalization.

What is Lemmatization?

Lemmatization in NLTK is the algorithmic process of finding the lemma of a word depending on its meaning and context. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, known as the lemma. The NLTK Lemmatization method is based on WordNet’s built-in morph function. Text preprocessing includes both stemming and lemmatization. Many people find the two terms confusing. Some treat these as the same, but there is a difference between stemming and lemmatization. Lemmatization is preferred over the former for the reason below.

Why is Lemmatization better than Stemming?

The stemming algorithm works by cutting the suffix from the word. In a broader sense, it cuts either the beginning or the end of the word. On the contrary, Lemmatization is a more powerful operation, and it takes into consideration the morphological analysis of the words. It returns the lemma, which is the base form of all its inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. Stemming is a general operation, while lemmatization is an intelligent operation where the proper form is looked up in the dictionary. Hence, lemmatization helps in forming better machine learning features.

Code to distinguish between Lemmatization and Stemming

Stemming Code:

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer  = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))

Output:

Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri

Lemmatization Code:

import nltk
	from nltk.stem import 	WordNetLemmatizer
	wordnet_lemmatizer = WordNetLemmatizer()
	text = "studies studying cries cry"
	tokenization = nltk.word_tokenize(text)
	 for w in tokenization:
		print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))

Output:

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry

Discussion of Output

If you look at stemming for studies and studying, the output is the same (studi), but the NLTK lemmatizer provides a different lemma for both tokens: study for studies and studying for studying. So when we need to make a feature set to train a machine, it would be great if lemmatization is preferred.

Use Case of Lemmatizer

The Lemmatizer minimizes text ambiguity. Words like bicycle or bicycles are converted to the base word bicycle. It converts all words with the same meaning but different representation to their base form, reducing word density and helping prepare accurate features for training a machine. The cleaner the data, the more accurate your machine learning model will be. The NLTK Lemmatizer also saves memory and computational cost.

Real-time example showing use of WordNet Lemmatization and POS Tagging in Python:

from nltk.corpus import wordnet as wn
	from nltk.stem.wordnet import WordNetLemmatizer
	from nltk import word_tokenize, pos_tag
	from collections import defaultdict
	tag_map = defaultdict(lambda : wn.NOUN)
	tag_map['J'] = wn.ADJ
	tag_map['V'] = wn.VERB
	tag_map['R'] = wn.ADV
	text = "guru99 is a totally new kind of learning experience."
	tokens = word_tokenize(text)
	lemma_function = WordNetLemmatizer()
	for token, tag in pos_tag(tokens):
		lemma = lemma_function.lemmatize(token, tag_map[tag[0]])
		print(token, "=>", lemma)

Output:

guru99 => guru99
is => be
totally => totally
new => new
kind => kind
of => of
learning => learn
experience => experience
. => .

Code Explanation:

The corpus reader wordnet is imported, and WordNetLemmatizer is imported from wordnet.
Word tokenize and parts of speech tag are imported from nltk, and Default Dictionary from collections.
A dictionary is created where the first letter of pos_tag is the key, mapped to the value from the wordnet dictionary.
The text is written and tokenized, and the object lemma_function is created for use inside the loop.
The loop is run, and lemmatize takes two arguments: the token and a mapping of pos_tag with the wordnet value.

Python Lemmatization has a close relation with the WordNet dictionary, so it is essential to study that topic next.

FAQs

The Porter stemmer is NLTK’s PorterStemmer class. It applies suffix-stripping rules from Martin Porter’s 1980 algorithm to reduce English words to their stem, so “waiting,” “waited,” and “waits” all become “wait.”

Use stemming when speed and simplicity matter more than readability, such as search indexing, large-scale sentiment analysis, or feature reduction. Choose lemmatization when the output must be a valid, human-readable word.

Text normalization converts different forms of a word into one common representation. Stemming and lemmatization are two normalization techniques that reduce word density so a model treats variants like “study” and “studies” as a single feature.

NLTK provides PorterStemmer (and other stemmers like Snowball) for stemming, and WordNetLemmatizer for lemmatization. The lemmatizer relies on the WordNet dictionary to return the correct base form of each word.

They remove redundant word variations, so a machine learning model sees cleaner, lower-dimensional features. Fewer duplicate tokens mean less ambiguity during training and more accurate predictions on unseen text.

Modern AI models infer context automatically, but NLTK’s WordNetLemmatizer needs a part-of-speech tag to disambiguate. Supplying that tag lets it pick the right lemma, for example turning “learning” into “learn” when tagged as a verb.

A word can be a noun or a verb with different lemmas. Without a part-of-speech tag, WordNetLemmatizer assumes a noun. Passing the correct tag, such as verb or adjective, gives the proper base form.

No. Stemming only crops suffixes, so “studies” becomes “studi” and “cries” becomes “cri,” which are not valid words. Lemmatization, by contrast, always returns a real dictionary word like “study.”

Stemming and Lemmatization in Python NLTK with Examples

What is Stemming and Lemmatization in Python NLTK?

What is Stemming?

Program for Understanding Stemming

What is Lemmatization?

Why is Lemmatization better than Stemming?

Code to distinguish between Lemmatization and Stemming

Discussion of Output

Use Case of Lemmatizer

FAQs

Summarize this post with:

Sign up for the newsletter

What is Stemming and Lemmatization in Python NLTK?

What is Stemming?

Program for Understanding Stemming

RELATED ARTICLES

What is Lemmatization?

Why is Lemmatization better than Stemming?

Code to distinguish between Lemmatization and Stemming

Discussion of Output

Use Case of Lemmatizer

FAQs

Summarize this post with:

Sign up for the newsletter