Tokenization in NLP

March 24, 2025

Tokenization

Tokenization is the process of splitting text into smaller units called tokens (words, phrases, or subwords) for analysis.

Types of Tokenization

(a) Word Tokenization

Splits text into words based on spaces or punctuation.

Example:

from nltk.tokenize import word_tokenize
text = "I love Natural Language Processing!"
print(word_tokenize(text))

Output: ['I', 'love', 'Natural', 'Language', 'Processing', '!']

(b) Sentence Tokenization

Splits text into sentences based on punctuation like . or !.

Example:

from nltk.tokenize import sent_tokenize
text = "NLP is amazing. It helps machines understand language."
print(sent_tokenize(text))

Output: ['NLP is amazing.', 'It helps machines understand language.']

(c) Subword Tokenization

Breaks words into smaller meaningful units, used in deep learning models (e.g., BERT, WordPiece).
Example:
- "unhappiness" → "un", "happiness"
- "playing" → "play", "##ing"

(d) Character Tokenization

Splits text into individual characters, useful for languages without spaces (like Chinese).
Example: "hello" → ['h', 'e', 'l', 'l', 'o']

Tokenization Libraries in Python

Using NLTK (Natural Language Toolkit)

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Tokenization is an important step in NLP. It helps process text."
print(word_tokenize(text))  # Word tokenization
print(sent_tokenize(text))  # Sentence tokenization

Using SpaCy

import spacy
nlp = spacy.load("en_core_web_sm")

text = "Tokenization is important in NLP."
doc = nlp(text)

tokens = [token.text for token in doc]
print(tokens)  # ['Tokenization', 'is', 'important', 'in', 'NLP', '.']

Using Hugging Face Tokenizers (For Transformer Models)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization in NLP is crucial!")
print(tokens)  # ['tokenization', 'in', 'nlp', 'is', 'crucial', '!']

Challenges in Tokenization

Handling Different Languages – Some languages (e.g., Chinese) don’t have spaces.
Ambiguity – "New York" should be one token, not two.
Slang & Abbreviations – "gonna" vs. "going to".
Handling Hyphenated Words – "mother-in-law" should be one token.

Applications of Tokenization

Search Engines (Indexing words for faster search).
Chatbots & Assistants (Understanding user queries).
Text Classification (Tokenized words as input for ML models).
Sentiment Analysis (Breaking text for polarity detection).

Search This Blog

NATURAL LANGUAGE PROCESSING

Tokenization in NLP

Tokenization

Types of Tokenization

(a) Word Tokenization

(b) Sentence Tokenization

(c) Subword Tokenization

(d) Character Tokenization

Tokenization Libraries in Python

Using NLTK (Natural Language Toolkit)

Using SpaCy

Using Hugging Face Tokenizers (For Transformer Models)

Challenges in Tokenization

Applications of Tokenization

Comments

Post a Comment

Popular posts from this blog

Dependency Parsing in NLP

Challenges in NLP

Syntax Analysis (Parsing) in NLP