Introduction
Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. From chatbots to translation to sentiment analysis, NLP powers many modern AI applications. This guide covers NLP fundamentals and practical techniques.
What Is NLP
Defining NLP
NLP is a subfield of AI focused on enabling computers to process and understand human language. It bridges linguistics and computer science.
Why NLP Matters
- Scale: Analyze massive text data
- Automation: Automate text-based tasks
- Insight: Extract meaning from unstructured data
- Accessibility: Enable human-computer interaction
Text Preprocessing
Essential Steps
Tokenization
Breaking text into words, sentences, or subwords:
import nltk
# Word tokenization
text = "Natural Language Processing is fascinating."
tokens = nltk.word_tokenize(text)
# Sentence tokenization
sentences = nltk.sent_tokenize(text)
Lowercasing
text = "HELLO World"
text = text.lower() # "hello world"
Removing Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]
Stemming and Lemmatization
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Stemming
stemmer.stem("running") # "run"
# Lemmatization
lemmatizer.lemmatize("running") # "run"
Text Representation
Bag of Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"This is a sample.",
"NLP is interesting.",
"Machine learning is powerful."
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
Word Embeddings
Word2Vec:
from gensim.models import Word2Vec
sentences = [["hello", "world"], ["nlp", "is", "fun"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
# Find similar words
model.wv.most_similar("hello")
Sentiment Analysis
Basic Approach
from textblob import TextBlob
text = "This product is amazing!"
blob = TextBlob(text)
print(blob.sentiment) # Sentiment(polarity=0.75, subjectivity=0.9)
Using Pre-trained Models
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")
result = sentiment("I love this product!")
Named Entity Recognition (NER)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.label_)
Text Classification
Traditional ML Approach
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
# Prepare data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)
# Train
model = MultinomialNB()
model.fit(X_train, train_labels)
# Predict
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)
Transformers and Modern NLP
What Are Transformers
Transformer models use self-attention to process sequential data in parallel, revolutionizing NLP.
Using Hugging Face
from transformers import pipeline
# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50)
# Question answering
qa = pipeline("question-answering")
result = qa(question="What is AI?", context="Artificial Intelligence is...")
Fine-tuning
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
dataset = load_dataset("imdb", split="train")
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
Building NLP Applications
Chatbots
from transformers import ConversationPipeline
chatbot = pipeline(model="microsoft/DialoGPT-medium")
response = chatbot("Hello, how are you?")
Text Summarization
summarizer = pipeline("summarization")
summary = summarizer(
"Long text to summarize...",
max_length=130,
min_length=30
)
Challenges in NLP
- Ambiguity: Language is often ambiguous
- Context: Understanding context is difficult
- Sarcasm: Detecting sarcasm and irony
- Multilingual: Different languages have different structures
- Bias: Models can perpetuate biases
Conclusion
NLP has transformed how we interact with technology. From basic text processing to transformer-based models, the field continues to evolve rapidly. Start with fundamentals and build toward more complex applications.
Comments