Tokenization in NLP

 

Tokenization 

Tokenization is the process of splitting text into smaller units called tokens (words, phrases, or subwords) for analysis.

Types of Tokenization

(a) Word Tokenization

  • Splits text into words based on spaces or punctuation.
  • Example:
    from nltk.tokenize import word_tokenize
    text = "I love Natural Language Processing!"
    print(word_tokenize(text))
    
    Output: ['I', 'love', 'Natural', 'Language', 'Processing', '!']

(b) Sentence Tokenization

  • Splits text into sentences based on punctuation like . or !.
  • Example:
    from nltk.tokenize import sent_tokenize
    text = "NLP is amazing. It helps machines understand language."
    print(sent_tokenize(text))
    
    Output: ['NLP is amazing.', 'It helps machines understand language.']

(c) Subword Tokenization

  • Breaks words into smaller meaningful units, used in deep learning models (e.g., BERT, WordPiece).
  • Example:
    • "unhappiness" → "un", "happiness"
    • "playing" → "play", "##ing"

(d) Character Tokenization

  • Splits text into individual characters, useful for languages without spaces (like Chinese).
  • Example: "hello" → ['h', 'e', 'l', 'l', 'o']

Tokenization Libraries in Python

Using NLTK (Natural Language Toolkit)

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Tokenization is an important step in NLP. It helps process text."
print(word_tokenize(text))  # Word tokenization
print(sent_tokenize(text))  # Sentence tokenization

Using SpaCy

import spacy
nlp = spacy.load("en_core_web_sm")

text = "Tokenization is important in NLP."
doc = nlp(text)

tokens = [token.text for token in doc]
print(tokens)  # ['Tokenization', 'is', 'important', 'in', 'NLP', '.']

Using Hugging Face Tokenizers (For Transformer Models)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization in NLP is crucial!")
print(tokens)  # ['tokenization', 'in', 'nlp', 'is', 'crucial', '!']

Challenges in Tokenization

  1. Handling Different Languages – Some languages (e.g., Chinese) don’t have spaces.
  2. Ambiguity – "New York" should be one token, not two.
  3. Slang & Abbreviations – "gonna" vs. "going to".
  4. Handling Hyphenated Words – "mother-in-law" should be one token.

Applications of Tokenization

  • Search Engines (Indexing words for faster search).
  • Chatbots & Assistants (Understanding user queries).
  • Text Classification (Tokenized words as input for ML models).
  • Sentiment Analysis (Breaking text for polarity detection).

Comments

Popular posts from this blog

Dependency Parsing in NLP

Challenges in NLP

Syntax Analysis (Parsing) in NLP