Tokenization in NLP
Tokenization
Tokenization is the process of splitting text into smaller units called tokens (words, phrases, or subwords) for analysis.
Types of Tokenization
(a) Word Tokenization
- Splits text into words based on spaces or punctuation.
- Example:
Output:from nltk.tokenize import word_tokenize text = "I love Natural Language Processing!" print(word_tokenize(text))['I', 'love', 'Natural', 'Language', 'Processing', '!']
(b) Sentence Tokenization
- Splits text into sentences based on punctuation like
.or!. - Example:
Output:from nltk.tokenize import sent_tokenize text = "NLP is amazing. It helps machines understand language." print(sent_tokenize(text))['NLP is amazing.', 'It helps machines understand language.']
(c) Subword Tokenization
- Breaks words into smaller meaningful units, used in deep learning models (e.g., BERT, WordPiece).
- Example:
"unhappiness"→"un", "happiness""playing"→"play", "##ing"
(d) Character Tokenization
- Splits text into individual characters, useful for languages without spaces (like Chinese).
- Example:
"hello"→['h', 'e', 'l', 'l', 'o']
Tokenization Libraries in Python
Using NLTK (Natural Language Toolkit)
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Tokenization is an important step in NLP. It helps process text."
print(word_tokenize(text)) # Word tokenization
print(sent_tokenize(text)) # Sentence tokenization
Using SpaCy
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Tokenization is important in NLP."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens) # ['Tokenization', 'is', 'important', 'in', 'NLP', '.']
Using Hugging Face Tokenizers (For Transformer Models)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization in NLP is crucial!")
print(tokens) # ['tokenization', 'in', 'nlp', 'is', 'crucial', '!']Challenges in Tokenization
- Handling Different Languages – Some languages (e.g., Chinese) don’t have spaces.
- Ambiguity –
"New York"should be one token, not two. - Slang & Abbreviations –
"gonna"vs."going to". - Handling Hyphenated Words –
"mother-in-law"should be one token.
Applications of Tokenization
- Search Engines (Indexing words for faster search).
- Chatbots & Assistants (Understanding user queries).
- Text Classification (Tokenized words as input for ML models).
- Sentiment Analysis (Breaking text for polarity detection).
Comments
Post a Comment