Introduction to Natural Language Processing

Introduction

Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. From chatbots to translation to sentiment analysis, NLP powers many modern AI applications. This guide covers NLP fundamentals and practical techniques.

What Is NLP

Defining NLP

NLP is a subfield of AI focused on enabling computers to process and understand human language. It bridges linguistics and computer science.

Why NLP Matters

Scale: Analyze massive text data
Automation: Automate text-based tasks
Insight: Extract meaning from unstructured data
Accessibility: Enable human-computer interaction

Text Preprocessing

Essential Steps

Tokenization

Breaking text into words, sentences, or subwords:

import nltk

# Word tokenization
text = "Natural Language Processing is fascinating."
tokens = nltk.word_tokenize(text)

# Sentence tokenization
sentences = nltk.sent_tokenize(text)

Lowercasing

text = "HELLO World"
text = text.lower()  # "hello world"

Removing Stopwords

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]

Stemming and Lemmatization

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming
stemmer.stem("running")  # "run"

# Lemmatization
lemmatizer.lemmatize("running")  # "run"

Text Representation

Bag of Words (BoW)

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is a sample.",
    "NLP is interesting.",
    "Machine learning is powerful."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

Word Embeddings

Word2Vec:

from gensim.models import Word2Vec

sentences = [["hello", "world"], ["nlp", "is", "fun"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Find similar words
model.wv.most_similar("hello")

Sentiment Analysis

Basic Approach

from textblob import TextBlob

text = "This product is amazing!"
blob = TextBlob(text)

print(blob.sentiment)  # Sentiment(polarity=0.75, subjectivity=0.9)

Using Pre-trained Models

from transformers import pipeline

sentiment = pipeline("sentiment-analysis")
result = sentiment("I love this product!")

Named Entity Recognition (NER)

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)

Text Classification

Traditional ML Approach

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Prepare data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)

# Train
model = MultinomialNB()
model.fit(X_train, train_labels)

# Predict
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)

Transformers and Modern NLP

What Are Transformers

Transformer models use self-attention to process sequential data in parallel, revolutionizing NLP.

Using Hugging Face

from transformers import pipeline

# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50)

# Question answering
qa = pipeline("question-answering")
result = qa(question="What is AI?", context="Artificial Intelligence is...")

Fine-tuning

from transformers import Trainer, TrainingArguments
from datasets import load_dataset

dataset = load_dataset("imdb", split="train")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Building NLP Applications

Chatbots

from transformers import ConversationPipeline

chatbot = pipeline(model="microsoft/DialoGPT-medium")
response = chatbot("Hello, how are you?")

Text Summarization

summarizer = pipeline("summarization")
summary = summarizer(
    "Long text to summarize...",
    max_length=130,
    min_length=30
)

Challenges in NLP

Ambiguity: Language is often ambiguous
Context: Understanding context is difficult
Sarcasm: Detecting sarcasm and irony
Multilingual: Different languages have different structures
Bias: Models can perpetuate biases

Conclusion

NLP has transformed how we interact with technology. From basic text processing to transformer-based models, the field continues to evolve rapidly. Start with fundamentals and build toward more complex applications.