Stemming vs. Lemmatization in NLP

 

Stemming vs. Lemmatization 

Stemming and Lemmatization are text normalization techniques in Natural Language Processing (NLP). Both methods reduce words to their base or root form, but they differ in how they achieve this.

1. What is Stemming?

Stemming is the process of reducing a word to its root form by removing prefixes and suffixes (affixes). It applies heuristic rules (not dictionary-based), which may sometimes produce non-meaningful words.

Example of Stemming:

Original Word           Stemmed Word
Running                  run
Studies                                studi
Happily                  happi
Better                  better (incorrect as "bet" is expected)

Stemming does not guarantee valid words (e.g., "happily""happi").
It is faster and less accurate than lemmatization.

2. What is Lemmatization?

Lemmatization reduces words to their dictionary form (lemma) based on linguistic rules. Unlike stemming, it considers context and part of speech (POS), producing meaningful root words.

Example of Lemmatization:

Original Word     Lemmatized Word
Running     run
Studies     study
Happily     happy
Better     good

Lemmatization ensures valid words (e.g., "happily""happy").
It is slower but more accurate than stemming because it uses a dictionary lookup.

3. Differences Between Stemming and Lemmatization

Feature Stemming Lemmatization
Definition Reduces words by chopping off prefixes/suffixes Converts words to their dictionary form
Speed Faster (rule-based) Slower (dictionary-based)
Accuracy Less accurate More accurate
Output May produce non-meaningful words Always produces valid words
Context-aware? No (blindly removes suffixes) Yes (considers POS & meaning)
Use Case Simple text pre-processing (e.g., search engines) NLP tasks requiring high accuracy (e.g., chatbots, machine translation)

4. Stemming in Python (Using NLTK)

NLTK provides different stemming algorithms, with Porter Stemmer and Lancaster Stemmer being the most common.

4.1 Using Porter Stemmer

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "studies", "happily", "better"]

stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

Output

['run', 'studi', 'happili', 'better']

"studies" → "studi" and "happily" → "happili" (incorrect stems).

4.2 Using Lancaster Stemmer

The Lancaster Stemmer is more aggressive than Porter Stemmer.

from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
words = ["running", "studies", "happily", "better"]

stemmed_words = [ls.stem(word) for word in words]
print(stemmed_words)

Output

['run', 'study', 'happy', 'bet']

"better" → "bet" (incorrect).

5. Lemmatization in Python (Using NLTK & SpaCy)

5.1 Lemmatization Using WordNetLemmatizer (NLTK)

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "happily", "better"]

lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]  # "v" means verb
print(lemmatized_words)

Output

['run', 'study', 'happy', 'good']

"better" → "good" (correct transformation).

5.2 Lemmatization Using SpaCy

SpaCy provides part-of-speech (POS) tagging for more accurate lemmatization.

import spacy

nlp = spacy.load("en_core_web_sm")
text = "running studies happily better"
doc = nlp(text)

lemmatized_words = [token.lemma_ for token in doc]
print(lemmatized_words)

Output

['run', 'study', 'happily', 'good']

"happily" remains unchanged because it is an adverb.


6. When to Use Stemming vs. Lemmatization?

Scenario Use Stemming? Use Lemmatization?
Search Engines (Fast but approximate matching)        Yes    No
Chatbots & NLP Apps (Accuracy needed)        No    Yes
Sentiment Analysis        No    Yes
Machine Translation        No    Yes
Keyword Extraction (General text processing)        Yes     No

7. Challenges in Stemming and Lemmatization

7.1 Challenges in Stemming

  1. May produce non-dictionary words ("running""runn")
  2. Different stemmers give different results
  3. Struggles with irregular words ("better""bet")

7.2 Challenges in Lemmatization

  1. Requires POS tagging for better results
  2. Slower due to dictionary lookup
  3.  Requires external libraries like WordNet or SpaCy


Comments

Popular posts from this blog

Dependency Parsing in NLP

Challenges in NLP

Syntax Analysis (Parsing) in NLP