Introduction
Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. Python has the richest NLP ecosystem of any language — from classical tools like NLTK to production-grade libraries like spaCy to state-of-the-art transformer models via Hugging Face.
The NLP Pipeline
A typical NLP pipeline processes text through several stages:
Raw Text
↓
Tokenization (split into words/sentences)
↓
Normalization (lowercase, remove punctuation)
↓
Stop Word Removal (remove "the", "is", "at"...)
↓
Stemming/Lemmatization (reduce to root form)
↓
Feature Extraction (TF-IDF, embeddings)
↓
Model / Task (classification, NER, translation...)
NLTK: The Classic NLP Library
NLTK (Natural Language Toolkit) is the foundational Python NLP library — excellent for learning and research.
pip install nltk
Tokenization
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural language processing is fascinating. Python makes it accessible."
# Word tokenization
words = word_tokenize(text)
print(words)
# => ['Natural', 'language', 'processing', 'is', 'fascinating', '.', ...]
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# => ['Natural language processing is fascinating.', 'Python makes it accessible.']
# Regex tokenization
import re
tokens = re.split(r'\s+', text)
Stemming and Lemmatization
Stemming reduces words to their root form (may not be a real word). Lemmatization reduces to the dictionary form.
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "ran", "easily", "fairly"]
for word in words:
print(f"{word:12} | Porter: {porter.stem(word):10} | Lancaster: {lancaster.stem(word):10} | Lemma: {lemmatizer.lemmatize(word)}")
# running | Porter: run | Lancaster: run | Lemma: running
# runs | Porter: run | Lancaster: run | Lemma: run
Stop Words
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tokens = word_tokenize("This is a sample sentence showing stop word removal")
filtered = [w for w in tokens if w.lower() not in stop_words]
print(filtered)
# => ['sample', 'sentence', 'showing', 'stop', 'word', 'removal']
Part-of-Speech Tagging
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
tokens = word_tokenize("The quick brown fox jumps over the lazy dog")
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
# => [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ...]
# DT=determiner, JJ=adjective, NN=noun, VBZ=verb
Frequency Distribution
from nltk import FreqDist
tokens = word_tokenize("the cat sat on the mat the cat")
fdist = FreqDist(tokens)
print(fdist.most_common(5))
# => [('the', 3), ('cat', 2), ('sat', 1), ('on', 1), ('mat', 1)]
spaCy: Production-Grade NLP
spaCy is faster and more production-ready than NLTK. It provides pre-trained models for many languages.
pip install spacy
python -m spacy download en_core_web_sm # small English model
python -m spacy download en_core_web_lg # large English model (better accuracy)
Basic Pipeline
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# Tokens
for token in doc:
print(f"{token.text:15} {token.pos_:8} {token.dep_:10} {token.lemma_}")
# Named Entity Recognition
for ent in doc.ents:
print(f"{ent.text:20} {ent.label_:10} {spacy.explain(ent.label_)}")
# Apple ORG Companies, agencies, institutions
# U.K. GPE Countries, cities, states
# $1 billion MONEY Monetary values
Named Entity Recognition (NER)
text = """
Elon Musk founded SpaceX in 2002 in Hawthorne, California.
Tesla was incorporated in 2003 and went public in 2010.
"""
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:25} → {ent.label_} ({spacy.explain(ent.label_)})")
# Elon Musk → PERSON (People, including fictional)
# SpaceX → ORG (Companies, agencies, institutions)
# 2002 → DATE (Absolute or relative dates or periods)
# Hawthorne, California → GPE (Countries, cities, states)
Dependency Parsing
doc = nlp("The cat chased the mouse")
for token in doc:
print(f"{token.text:10} → {token.dep_:10} → head: {token.head.text}")
# The → det → head: cat
# cat → nsubj → head: chased
# chased → ROOT → head: chased
# the → det → head: mouse
# mouse → dobj → head: chased
Text Similarity
nlp = spacy.load("en_core_web_lg") # need large model for vectors
doc1 = nlp("I like cats")
doc2 = nlp("I love dogs")
doc3 = nlp("The stock market crashed")
print(doc1.similarity(doc2)) # => ~0.85 (similar topic)
print(doc1.similarity(doc3)) # => ~0.3 (different topic)
Hugging Face Transformers: State of the Art
Transformers provides access to thousands of pre-trained models (BERT, GPT, T5, etc.).
pip install transformers torch
Sentiment Analysis
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
results = classifier([
"I love this product! It's amazing.",
"This is the worst experience I've ever had.",
"The weather is okay today."
])
for result in results:
print(f"{result['label']:10} ({result['score']:.3f})")
# POSITIVE (0.999)
# NEGATIVE (0.998)
# POSITIVE (0.612)
Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
text = "Hugging Face was founded in New York by Clément Delangue and Julien Chaumond."
entities = ner(text)
for entity in entities:
print(f"{entity['word']:25} → {entity['entity_group']}")
# Hugging Face → ORG
# New York → LOC
# Clément Delangue → PER
# Julien Chaumond → PER
Text Generation
generator = pipeline("text-generation", model="gpt2")
result = generator(
"The future of artificial intelligence is",
max_length=50,
num_return_sequences=2
)
for r in result:
print(r['generated_text'])
Question Answering
qa = pipeline("question-answering")
context = """
Python was created by Guido van Rossum and first released in 1991.
It emphasizes code readability and simplicity.
"""
result = qa(question="Who created Python?", context=context)
print(result['answer']) # => "Guido van Rossum"
print(result['score']) # => confidence score
Text Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
The Industrial Revolution was a period of major industrialization and innovation
that took place during the late 1700s and early 1800s. The Industrial Revolution
began in Great Britain and quickly spread throughout Europe and North America.
The first Industrial Revolution began in Great Britain in the 1700s and 1800s
and was a time of significant innovation.
"""
summary = summarizer(article, max_length=60, min_length=20)
print(summary[0]['summary_text'])
Web Scraping for NLP Data
Beautiful Soup extracts text from HTML:
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/Natural_language_processing"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all paragraph text
paragraphs = soup.find_all('p')
text = ' '.join([p.get_text() for p in paragraphs])
# Now process with spaCy or NLTK
doc = nlp(text[:5000]) # first 5000 chars
Reading PDF Files
pip install pypdf
from pypdf import PdfReader
reader = PdfReader("document.pdf")
text = ""
for page in reader.pages:
text += page.extract_text()
# Process the extracted text
doc = nlp(text)
TF-IDF: Finding Important Words
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"Python is great for data science",
"Machine learning uses Python extensively",
"Natural language processing is a field of AI",
"Deep learning has revolutionized NLP"
]
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
# Get feature names and scores for first document
feature_names = vectorizer.get_feature_names_out()
scores = tfidf_matrix[0].toarray()[0]
# Top words in first document
top_indices = scores.argsort()[-5:][::-1]
for i in top_indices:
if scores[i] > 0:
print(f"{feature_names[i]:20} {scores[i]:.4f}")
Choosing the Right Tool
| Task | Recommended Tool |
|---|---|
| Learning NLP concepts | NLTK |
| Production NLP pipeline | spaCy |
| State-of-the-art accuracy | Hugging Face Transformers |
| Sentiment analysis | Transformers (distilbert) |
| NER | spaCy or Transformers |
| Text classification | Transformers (fine-tuned BERT) |
| Topic modeling | gensim |
| Web scraping for text | BeautifulSoup + requests |
| PDF text extraction | pypdf |
Resources
- NLTK Documentation
- spaCy Documentation
- Hugging Face Transformers
- Natural Language Processing with Python (free book)
- fast.ai NLP Course
Comments