Stemming vs. Lemmatization in NLP
Stemming vs. Lemmatization
Stemming and Lemmatization are text normalization techniques in Natural Language Processing (NLP). Both methods reduce words to their base or root form, but they differ in how they achieve this.
1. What is Stemming?
Stemming is the process of reducing a word to its root form by removing prefixes and suffixes (affixes). It applies heuristic rules (not dictionary-based), which may sometimes produce non-meaningful words.
Example of Stemming:
| Original Word | Stemmed Word |
|---|---|
| Running | run |
| Studies | studi |
| Happily | happi |
| Better | better (incorrect as "bet" is expected) |
2. What is Lemmatization?
Lemmatization reduces words to their dictionary form (lemma) based on linguistic rules. Unlike stemming, it considers context and part of speech (POS), producing meaningful root words.
Example of Lemmatization:
| Original Word | Lemmatized Word |
|---|---|
| Running | run |
| Studies | study |
| Happily | happy |
| Better | good |
3. Differences Between Stemming and Lemmatization
| Feature | Stemming | Lemmatization |
|---|---|---|
| Definition | Reduces words by chopping off prefixes/suffixes | Converts words to their dictionary form |
| Speed | Faster (rule-based) | Slower (dictionary-based) |
| Accuracy | Less accurate | More accurate |
| Output | May produce non-meaningful words | Always produces valid words |
| Context-aware? | No (blindly removes suffixes) | Yes (considers POS & meaning) |
| Use Case | Simple text pre-processing (e.g., search engines) | NLP tasks requiring high accuracy (e.g., chatbots, machine translation) |
4. Stemming in Python (Using NLTK)
NLTK provides different stemming algorithms, with Porter Stemmer and Lancaster Stemmer being the most common.
4.1 Using Porter Stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "studies", "happily", "better"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
Output
['run', 'studi', 'happili', 'better']
"studies" → "studi" and "happily" → "happili" (incorrect stems).
4.2 Using Lancaster Stemmer
The Lancaster Stemmer is more aggressive than Porter Stemmer.
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()
words = ["running", "studies", "happily", "better"]
stemmed_words = [ls.stem(word) for word in words]
print(stemmed_words)
Output
['run', 'study', 'happy', 'bet']
"better" → "bet" (incorrect).
5. Lemmatization in Python (Using NLTK & SpaCy)
5.1 Lemmatization Using WordNetLemmatizer (NLTK)
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "happily", "better"]
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words] # "v" means verb
print(lemmatized_words)
Output
['run', 'study', 'happy', 'good']
"better" → "good" (correct transformation).
5.2 Lemmatization Using SpaCy
SpaCy provides part-of-speech (POS) tagging for more accurate lemmatization.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "running studies happily better"
doc = nlp(text)
lemmatized_words = [token.lemma_ for token in doc]
print(lemmatized_words)
Output
['run', 'study', 'happily', 'good']
"happily" remains unchanged because it is an adverb.
6. When to Use Stemming vs. Lemmatization?
| Scenario | Use Stemming? | Use Lemmatization? |
|---|---|---|
| Search Engines (Fast but approximate matching) | Yes | No |
| Chatbots & NLP Apps (Accuracy needed) | No | Yes |
| Sentiment Analysis | No | Yes |
| Machine Translation | No | Yes |
| Keyword Extraction (General text processing) | Yes | No |
7. Challenges in Stemming and Lemmatization
7.1 Challenges in Stemming
- May produce non-dictionary words ("running" → "runn")
- Different stemmers give different results
- Struggles with irregular words ("better" → "bet")
7.2 Challenges in Lemmatization
- Requires POS tagging for better results
- Slower due to dictionary lookup
- Requires external libraries like WordNet or SpaCy
Comments
Post a Comment