Stemming vs. Lemmatization in NLP

March 25, 2025

Stemming vs. Lemmatization

Stemming and Lemmatization are text normalization techniques in Natural Language Processing (NLP). Both methods reduce words to their base or root form, but they differ in how they achieve this.

1. What is Stemming?

Stemming is the process of reducing a word to its root form by removing prefixes and suffixes (affixes). It applies heuristic rules (not dictionary-based), which may sometimes produce non-meaningful words.

Example of Stemming:

Original Word	Stemmed Word
Running	run
Studies	studi
Happily	happi
Better	better (incorrect as "bet" is expected)

Stemming does not guarantee valid words (e.g., "happily" → "happi").

It is faster and less accurate than lemmatization.

2. What is Lemmatization?

Lemmatization reduces words to their dictionary form (lemma) based on linguistic rules. Unlike stemming, it considers context and part of speech (POS), producing meaningful root words.

Example of Lemmatization:

Original Word	Lemmatized Word
Running	run
Studies	study
Happily	happy
Better	good

Lemmatization ensures valid words (e.g., "happily" → "happy").

It is slower but more accurate than stemming because it uses a dictionary lookup.

3. Differences Between Stemming and Lemmatization

Feature	Stemming	Lemmatization
Definition	Reduces words by chopping off prefixes/suffixes	Converts words to their dictionary form
Speed	Faster (rule-based)	Slower (dictionary-based)
Accuracy	Less accurate	More accurate
Output	May produce non-meaningful words	Always produces valid words
Context-aware?	No (blindly removes suffixes)	Yes (considers POS & meaning)
Use Case	Simple text pre-processing (e.g., search engines)	NLP tasks requiring high accuracy (e.g., chatbots, machine translation)

4. Stemming in Python (Using NLTK)

NLTK provides different stemming algorithms, with Porter Stemmer and Lancaster Stemmer being the most common.

4.1 Using Porter Stemmer

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "studies", "happily", "better"]

stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

Output

['run', 'studi', 'happili', 'better']

"studies" → "studi" and "happily" → "happili" (incorrect stems).

4.2 Using Lancaster Stemmer

The Lancaster Stemmer is more aggressive than Porter Stemmer.

from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
words = ["running", "studies", "happily", "better"]

stemmed_words = [ls.stem(word) for word in words]
print(stemmed_words)

Output

['run', 'study', 'happy', 'bet']

"better" → "bet" (incorrect).

5. Lemmatization in Python (Using NLTK & SpaCy)

5.1 Lemmatization Using WordNetLemmatizer (NLTK)

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "happily", "better"]

lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]  # "v" means verb
print(lemmatized_words)

Output

['run', 'study', 'happy', 'good']

"better" → "good" (correct transformation).

5.2 Lemmatization Using SpaCy

SpaCy provides part-of-speech (POS) tagging for more accurate lemmatization.

import spacy

nlp = spacy.load("en_core_web_sm")
text = "running studies happily better"
doc = nlp(text)

lemmatized_words = [token.lemma_ for token in doc]
print(lemmatized_words)

Output

['run', 'study', 'happily', 'good']

"happily" remains unchanged because it is an adverb.

6. When to Use Stemming vs. Lemmatization?

Scenario	Use Stemming?	Use Lemmatization?
Search Engines (Fast but approximate matching)	Yes	No
Chatbots & NLP Apps (Accuracy needed)	No	Yes
Sentiment Analysis	No	Yes
Machine Translation	No	Yes
Keyword Extraction (General text processing)	Yes	No

Search This Blog

NATURAL LANGUAGE PROCESSING