Skip to main content

Introduction to Natural Language Processing

Created: March 9, 2026 Larry Qu 10 min read

Introduction

Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. From chatbots to translation to sentiment analysis, NLP powers many modern AI applications. This guide covers NLP fundamentals and practical techniques.

What Is NLP

Defining NLP

NLP is a subfield of AI focused on enabling computers to process and understand human language. It bridges linguistics and computer science.

Why NLP Matters

  • Scale: Analyze massive text data
  • Automation: Automate text-based tasks
  • Insight: Extract meaning from unstructured data
  • Accessibility: Enable human-computer interaction

Text Preprocessing

Essential Steps

Tokenization

Breaking text into words, sentences, or subwords:

import nltk

# Word tokenization
text = "Natural Language Processing is fascinating."
tokens = nltk.word_tokenize(text)

# Sentence tokenization
sentences = nltk.sent_tokenize(text)

Lowercasing

text = "HELLO World"
text = text.lower()  # "hello world"

Removing Stopwords

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]

Stemming and Lemmatization

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming
stemmer.stem("running")  # "run"

# Lemmatization
lemmatizer.lemmatize("running")  # "run"

Advanced Text Preprocessing

Modern NLP pipelines require more sophisticated preprocessing for real-world text:

import re
import unicodedata

def advanced_clean(text: str) -> str:
    """Advanced text cleaning for real-world NLP."""
    # Normalize Unicode (e.g., smart quotes → straight quotes)
    text = unicodedata.normalize('NFKC', text)

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Handle contractions
    contractions = {
        "n't": " not", "'re": " are", "'s": " is",
        "'ll": " will", "'ve": " have", "'m": " am"
    }
    for short, long in contractions.items():
        text = text.replace(short, long)

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Subword tokenization with Hugging Face
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer(
    "Natural Language Processing is fascinating!",
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)
print(f"Input IDs: {tokens['input_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")

Regular Expressions for Pattern Extraction

import re

def extract_entities(text: str) -> dict:
    """Extract structured information using regex patterns."""
    patterns = {
        "email": r'\b[\w\.-]+@[\w\.-]+\.\w{2,}\b',
        "phone": r'\b\+?\d[\d\s\-\(\)]{7,}\d\b',
        "url": r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+',
        "price": r'\$\d+(?:,\d{3})*(?:\.\d{2})?',
        "date": r'\b\d{4}[-/]\d{1,2}[-/]\d{1,2}\b',
    }
    found = {}
    for name, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            found[name] = matches
    return found

text = "Contact [email protected] or call +1-555-123-4567. Price: $1,299.99. Date: 2026-05-24."
print(extract_entities(text))
# {'email': ['[email protected]'], 'phone': ['+1-555-123-4567'], 'price': ['$1,299.99'], 'date': ['2026-05-24']}

Text Representation

Bag of Words (BoW)

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is a sample.",
    "NLP is interesting.",
    "Machine learning is powerful."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

Word Embeddings

Word2Vec:

from gensim.models import Word2Vec

sentences = [["hello", "world"], ["nlp", "is", "fun"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Find similar words
model.wv.most_similar("hello")

Contextual Embeddings (ELMo, BERT)

Unlike static Word2Vec embeddings, contextual models generate different vectors for the same word in different contexts:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Same word "bank" in different contexts
sentences = [
    "I need to go to the bank to deposit money.",
    "Let's sit on the river bank and enjoy the view."
]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the [CLS] token embedding as the sentence representation
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    print(f"Sentence: {sentence[:40]}...")
    print(f"  Embedding shape: {cls_embedding.shape}")

Sentiment Analysis

Basic Approach

from textblob import TextBlob

text = "This product is amazing!"
blob = TextBlob(text)

print(blob.sentiment)  # Sentiment(polarity=0.75, subjectivity=0.9)

Using Pre-trained Models

from transformers import pipeline

sentiment = pipeline("sentiment-analysis")
result = sentiment("I love this product!")

Named Entity Recognition (NER)

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)

Text Classification

Traditional ML Approach

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Prepare data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)

# Train
model = MultinomialNB()
model.fit(X_train, train_labels)

# Predict
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)

Transformer Architecture Deep-Dive

The Attention Mechanism

The key innovation of transformers is the self-attention mechanism, which computes how much each token should “attend” to every other token in the sequence:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Compute scaled dot-product attention.

    Q, K, V: (batch, heads, seq_len, d_k)
    """
    scores = torch.matmul(Q, K.transpose(-2, -1))  # (b, h, seq, seq)
    scores = scores / (K.size(-1) ** 0.5)  # Scale by sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# Example: single attention head
batch, seq_len, d_model = 1, 4, 8
x = torch.randn(batch, seq_len, d_model)

# Linear projections for Q, K, V
W_q = torch.randn(d_model, d_model)
W_k = torch.randn(d_model, d_model)
W_v = torch.randn(d_model, d_model)

Q = x @ W_q
K = x @ W_k
V = x @ W_v

output, attn = scaled_dot_product_attention(Q, K, V)
print(f"Attention weights shape: {attn.shape}")  # (1, 4, 4)
print(f"Output shape: {output.shape}")  # (1, 4, 8)

The attention mechanism allows each token to directly access information from any other token, solving the long-range dependency problem that plagued RNNs and LSTMs. Multi-head attention runs multiple attention computations in parallel, capturing different types of relationships.

Encoder vs. Decoder Architecture

Architecture Models Best For Characteristics
Encoder-only BERT, RoBERTa, DeBERTa Classification, NER, QA Bidirectional context, understanding tasks
Decoder-only GPT-4, Claude, LLaMA Text generation, chatbots Autoregressive, left-to-right generation
Encoder-Decoder T5, BART, mT5 Translation, summarization Full sequence-to-sequence mapping

Major Model Families

BERT Family: Bidirectional Encoder Representations from Transformers. Pretrained on masked language modeling (predict masked words) and next-sentence prediction. Excels at understanding tasks.

from transformers import AutoModelForSequenceClassification, pipeline

# BERT for sentiment analysis
classifier = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment"
)
result = classifier("This product is excellent!")
print(result)  # [{'label': '5 stars', 'score': 0.92}]

GPT Family: Generative Pretrained Transformers. Decoder-only models trained on next-token prediction. Excel at text generation and few-shot learning.

T5 Family: Text-to-Text Transfer Transformer. Encoder-decoder that frames all NLP tasks as text-to-text problems. Prefix indicates the task (e.g., “translate English to French: …”).

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Translation (T5 frames it as text-to-text)
input_text = "translate English to French: The weather is beautiful today."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)  # "Le temps est beau aujourd'hui."

Efficient Attention in 2026

The latest NLP models use efficient attention mechanisms that reduce quadratic complexity:

Method Complexity Description Used By
Standard attention O(n²) Full pairwise attention Original transformer
Sparse attention O(n√n) Each token attends to a subset DeepSeek V4 CSA
Linear attention O(n) Kernel-based approximation Linformer, Performer
Flash attention O(n²) but 2x faster IO-aware exact attention GPT-4, Claude 4
Sliding window O(n·w) Local window + global tokens Mistral, Gemma 2
Hierarchical O(n log n) Multi-resolution context DeepSeek V4 HCA

NLP Evaluation Metrics

Text Generation Metrics

from datasets import load_metric

# BLEU for translation quality
bleu = load_metric("bleu")
predictions = ["the cat is on the mat"]
references = [["the cat is on the mat"]]
result = bleu.compute(predictions=predictions, references=references)
print(f"BLEU: {result['bleu']:.2f}")  # 1.0 for exact match

# ROUGE for summarization
rouge = load_metric("rouge")
results = rouge.compute(
    predictions=["The cat sat on the mat"],
    references=[["The cat was sitting on the mat"]]
)
print(f"ROUGE-L: {results['rougeL'].mid.fmeasure:.2f}")

# BERTScore using contextual embeddings
bertscore = load_metric("bertscore")
results = bertscore.compute(
    predictions=["The cat on the mat"],
    references=["The cat sat on the mat"],
    lang="en"
)
print(f"BERTScore F1: {results['f1'][0]:.2f}")
Metric What It Measures Range Best For
BLEU N-gram precision overlap 0-100 Machine translation
ROUGE N-gram recall overlap 0-100 Text summarization
METEOR Precision + recall with synonym matching 0-100 Translation, generation
BERTScore Semantic similarity via BERT embeddings 0-100 Any text generation
Perplexity Model confidence in predictions 1-∞ Language model quality
Exact Match Exact string matching 0-100 Question answering
F1 Score Harmonic mean of precision and recall 0-100 Classification, QA

Classification Metrics

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1, 1, 0]

report = classification_report(y_true, y_pred, target_names=["Negative", "Positive"])
print(report)

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"Confusion Matrix:\n{cm}")

Multilingual NLP

Cross-Lingual Models

Models like mBERT and XLM-RoBERTa are trained on 100+ languages simultaneously, enabling cross-lingual transfer:

from transformers import pipeline

# XLM-RoBERTa for multilingual NER
ner = pipeline(
    "ner",
    model="xlm-roberta-large-finetuned-conll03-english",
    aggregation_strategy="simple"
)

# Works across languages
texts = [
    "Apple is looking to buy a U.S. startup.",  # English
    "苹果公司正在考虑收购一家美国初创公司。",     # Chinese
    "Apple cherche à acheter une startup américaine.",  # French
]

for text in texts:
    entities = ner(text)
    print(f"Text: {text[:30]}...")
    for ent in entities:
        print(f"  {ent['word']}: {ent['entity_group']} (score: {ent['score']:.2f})")

Translation Pipeline

from transformers import MarianMTModel, MarianTokenizer

def translate(text: str, source_lang: str = "en", target_lang: str = "fr") -> str:
    model_name = f"Helsinki-NLP/opus-mt-{source_lang}-{target_lang}"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt", padding=True)
    translated = model.generate(**inputs, max_length=200)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

print(translate("Natural language processing is fascinating.", "en", "fr"))
# "Le traitement du langage naturel est fascinant."

Efficient Attention Mechanisms

The biggest trend in 2026 is efficient attention. Models like DeepSeek V4 use Compressed Sparse Attention (CSA) to reduce FLOPs to 27% of previous architectures at long context lengths. This makes million-token contexts economically viable for the first time.

Small Language Models (SLMs)

Smaller, more efficient models (2B-8B parameters) now rival the performance of 2024’s 70B+ models on most tasks. Distilled models like Phi-4, Gemma 3, and LLaMA-3.2-3B can run on consumer hardware while maintaining strong performance on standard NLP benchmarks.

Multimodal NLP

The boundary between NLP and computer vision is dissolving. Models like GPT-5, Claude Opus 4.7, and Gemini 3 process text, images, audio, and video as unified input streams. NLP is becoming “language understanding in any modality.”

Agentic NLP

NLP models are evolving from passive text processors to active agents that can browse the web, execute code, and take actions. Function calling and tool use have become standard capabilities in modern LLMs.

On-Device NLP

Frameworks like Google LiteRT, Qualcomm’s Neural Processing SDK, and Apple’s Core ML enable NLP models to run directly on phones and edge devices. Private, offline inference for tasks like sentiment analysis, translation, and text classification is now mainstream.

End-to-End NLP Project: Text Classification Pipeline

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import joblib

# 1. Sample data
texts = [
    "I love this product, it works perfectly!",       # positive
    "Terrible experience, would not recommend.",       # negative
    "Amazing quality and fast shipping.",              # positive
    "The worst purchase I've ever made.",              # negative
    "Good value for money, satisfied with purchase.",  # positive
    "Disappointed with the quality, broke in a week.", # negative
    "Excellent customer service and support.",         # positive
    "Not worth the price, very poor build quality.",   # negative
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# 3. Build pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])

# 4. Train
pipeline.fit(X_train, y_train)

# 5. Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

# 6. Save model
joblib.dump(pipeline, "sentiment_model.pkl")

# 7. Inference on new text
def predict_sentiment(text: str) -> dict:
    model = joblib.load("sentiment_model.pkl")
    proba = model.predict_proba([text])[0]
    sentiment = "positive" if model.predict([text])[0] == 1 else "negative"
    return {
        "text": text,
        "sentiment": sentiment,
        "confidence": float(max(proba))
    }

print(predict_sentiment("This is absolutely fantastic!"))

Challenges in NLP

  • Ambiguity: Language is inherently ambiguous — “I saw her duck” could mean a bird or a dodging action
  • Context: Understanding long-range context and world knowledge remains difficult
  • Sarcasm and Irony: Detecting non-literal language requires deeper pragmatic understanding
  • Multilingual: 7,000+ languages exist, most with limited training data
  • Bias: Models amplify training data biases, requiring careful data curation and debiasing
  • Hallucination: LLMs generate plausible but incorrect information, especially problematic in factual domains
  • Cost: Training and deploying large models requires significant computational resources

Conclusion

NLP has transformed how we interact with technology. From basic text preprocessing to billion-parameter transformer models, the field continues to evolve rapidly. The key to mastering NLP is building a strong foundation in text processing and representation, understanding transformer architectures, and staying current with the rapid pace of model development. Start with fundamentals (tokenization, TF-IDF, word embeddings), then progress to transformers and fine-tuning for your specific use cases.


Resources

Comments

👍 Was this article helpful?