NLP with Python: NLTK, spaCy, and Modern Transformers

Introduction

Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. Python has the richest NLP ecosystem of any language — from classical tools like NLTK to production-grade libraries like spaCy to state-of-the-art transformer models via Hugging Face.

The NLP Pipeline

A typical NLP pipeline processes text through several stages:

Raw Text
    ↓
Tokenization      (split into words/sentences)
    ↓
Normalization     (lowercase, remove punctuation)
    ↓
Stop Word Removal (remove "the", "is", "at"...)
    ↓
Stemming/Lemmatization (reduce to root form)
    ↓
Feature Extraction (TF-IDF, embeddings)
    ↓
Model / Task      (classification, NER, translation...)

NLTK: The Classic NLP Library

NLTK (Natural Language Toolkit) is the foundational Python NLP library — excellent for learning and research.

pip install nltk

Tokenization

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural language processing is fascinating. Python makes it accessible."

# Word tokenization
words = word_tokenize(text)
print(words)
# => ['Natural', 'language', 'processing', 'is', 'fascinating', '.', ...]

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# => ['Natural language processing is fascinating.', 'Python makes it accessible.']

# Regex tokenization
import re
tokens = re.split(r'\s+', text)

Stemming and Lemmatization

Stemming reduces words to their root form (may not be a real word). Lemmatization reduces to the dictionary form.

from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "ran", "easily", "fairly"]

for word in words:
    print(f"{word:12} | Porter: {porter.stem(word):10} | Lancaster: {lancaster.stem(word):10} | Lemma: {lemmatizer.lemmatize(word)}")

# running      | Porter: run        | Lancaster: run        | Lemma: running
# runs         | Porter: run        | Lancaster: run        | Lemma: run

Stop Words

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
tokens = word_tokenize("This is a sample sentence showing stop word removal")
filtered = [w for w in tokens if w.lower() not in stop_words]
print(filtered)
# => ['sample', 'sentence', 'showing', 'stop', 'word', 'removal']

Part-of-Speech Tagging

nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

tokens = word_tokenize("The quick brown fox jumps over the lazy dog")
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
# => [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ...]
# DT=determiner, JJ=adjective, NN=noun, VBZ=verb

Frequency Distribution

from nltk import FreqDist

tokens = word_tokenize("the cat sat on the mat the cat")
fdist = FreqDist(tokens)
print(fdist.most_common(5))
# => [('the', 3), ('cat', 2), ('sat', 1), ('on', 1), ('mat', 1)]

spaCy: Production-Grade NLP

spaCy is faster and more production-ready than NLTK. It provides pre-trained models for many languages.

pip install spacy
python -m spacy download en_core_web_sm   # small English model
python -m spacy download en_core_web_lg   # large English model (better accuracy)

Basic Pipeline

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Tokens
for token in doc:
    print(f"{token.text:15} {token.pos_:8} {token.dep_:10} {token.lemma_}")

# Named Entity Recognition
for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_:10} {spacy.explain(ent.label_)}")
# Apple                ORG        Companies, agencies, institutions
# U.K.                 GPE        Countries, cities, states
# $1 billion           MONEY      Monetary values

Named Entity Recognition (NER)

text = """
Elon Musk founded SpaceX in 2002 in Hawthorne, California.
Tesla was incorporated in 2003 and went public in 2010.
"""

doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text:25} → {ent.label_} ({spacy.explain(ent.label_)})")

# Elon Musk                 → PERSON (People, including fictional)
# SpaceX                    → ORG (Companies, agencies, institutions)
# 2002                      → DATE (Absolute or relative dates or periods)
# Hawthorne, California     → GPE (Countries, cities, states)

Dependency Parsing

doc = nlp("The cat chased the mouse")
for token in doc:
    print(f"{token.text:10} → {token.dep_:10} → head: {token.head.text}")

# The        → det        → head: cat
# cat        → nsubj      → head: chased
# chased     → ROOT       → head: chased
# the        → det        → head: mouse
# mouse      → dobj       → head: chased

Text Similarity

nlp = spacy.load("en_core_web_lg")  # need large model for vectors

doc1 = nlp("I like cats")
doc2 = nlp("I love dogs")
doc3 = nlp("The stock market crashed")

print(doc1.similarity(doc2))  # => ~0.85 (similar topic)
print(doc1.similarity(doc3))  # => ~0.3  (different topic)

Hugging Face Transformers: State of the Art

Transformers provides access to thousands of pre-trained models (BERT, GPT, T5, etc.).

pip install transformers torch

Sentiment Analysis

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

results = classifier([
    "I love this product! It's amazing.",
    "This is the worst experience I've ever had.",
    "The weather is okay today."
])

for result in results:
    print(f"{result['label']:10} ({result['score']:.3f})")
# POSITIVE   (0.999)
# NEGATIVE   (0.998)
# POSITIVE   (0.612)

Named Entity Recognition

ner = pipeline("ner", grouped_entities=True)

text = "Hugging Face was founded in New York by Clément Delangue and Julien Chaumond."
entities = ner(text)

for entity in entities:
    print(f"{entity['word']:25} → {entity['entity_group']}")
# Hugging Face              → ORG
# New York                  → LOC
# Clément Delangue          → PER
# Julien Chaumond           → PER

Text Generation

generator = pipeline("text-generation", model="gpt2")

result = generator(
    "The future of artificial intelligence is",
    max_length=50,
    num_return_sequences=2
)

for r in result:
    print(r['generated_text'])

Question Answering

qa = pipeline("question-answering")

context = """
Python was created by Guido van Rossum and first released in 1991.
It emphasizes code readability and simplicity.
"""

result = qa(question="Who created Python?", context=context)
print(result['answer'])  # => "Guido van Rossum"
print(result['score'])   # => confidence score

Text Summarization

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
The Industrial Revolution was a period of major industrialization and innovation
that took place during the late 1700s and early 1800s. The Industrial Revolution
began in Great Britain and quickly spread throughout Europe and North America.
The first Industrial Revolution began in Great Britain in the 1700s and 1800s
and was a time of significant innovation.
"""

summary = summarizer(article, max_length=60, min_length=20)
print(summary[0]['summary_text'])

Web Scraping for NLP Data

Beautiful Soup extracts text from HTML:

from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Natural_language_processing"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all paragraph text
paragraphs = soup.find_all('p')
text = ' '.join([p.get_text() for p in paragraphs])

# Now process with spaCy or NLTK
doc = nlp(text[:5000])  # first 5000 chars

Reading PDF Files

pip install pypdf

from pypdf import PdfReader

reader = PdfReader("document.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text()

# Process the extracted text
doc = nlp(text)

TF-IDF: Finding Important Words

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "Python is great for data science",
    "Machine learning uses Python extensively",
    "Natural language processing is a field of AI",
    "Deep learning has revolutionized NLP"
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names and scores for first document
feature_names = vectorizer.get_feature_names_out()
scores = tfidf_matrix[0].toarray()[0]

# Top words in first document
top_indices = scores.argsort()[-5:][::-1]
for i in top_indices:
    if scores[i] > 0:
        print(f"{feature_names[i]:20} {scores[i]:.4f}")

Choosing the Right Tool

Task	Recommended Tool
Learning NLP concepts	NLTK
Production NLP pipeline	spaCy
State-of-the-art accuracy	Hugging Face Transformers
Sentiment analysis	Transformers (distilbert)
NER	spaCy or Transformers
Text classification	Transformers (fine-tuned BERT)
Topic modeling	gensim
Web scraping for text	BeautifulSoup + requests
PDF text extraction	pypdf