Module 4 3 Support Vector Machines - iffatAGheyas/NLP-handbook GitHub Wiki
Module 4.3: Support Vector Machines
Support Vector Machines (SVMs) are max-margin classifiers that find a hyperplane separating classes with the largest possible margin. In text classification, linear SVMs often perform very well on high-dimensional sparse data.
1. Theory Recap
- Decision function:
2. scikit-learn LinearSVC on Text
This example builds a pipeline using TF–IDF features and LinearSVC
.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
# Toy dataset
docs = [
"cheap meds available now",
"team meeting at noon",
"win money win prizes",
"lunch with project team",
"exclusive offer just for you",
"report submission due tomorrow"
]
labels = ['spam','ham','spam','ham','spam','ham']
# Build pipeline: TF-IDF → LinearSVC
model = make_pipeline(
TfidfVectorizer(ngram_range=(1,2), stop_words='english'),
LinearSVC(C=1.0, max_iter=10000)
)
# Train & predict
model.fit(docs, labels)
tests = [
"limited time offer",
"project team lunch",
"win exclusive prize",
"report due today"
]
preds = model.predict(tests)
for text, pred in zip(tests, preds):
print(f"'{text}' → {pred}")
Output
3. Inspecting Top Features
For a linear SVM, the learned coefficient vector 𝑤 indicates feature importance. Extract the top positive and negative features:
import numpy as np
# Access vectorizer and classifier
vectorizer, clf = model.named_steps['tfidfvectorizer'], model.named_steps['linearsvc']
feature_names = vectorizer.get_feature_names_out()
coef = clf.coef_[0]
# Top 5 spam indicators (largest positive weights)
top_spam = np.argsort(coef)[-5:][::-1]
# Top 5 ham indicators (most negative weights)
top_ham = np.argsort(coef)[:5]
print("Top spam features:")
for idx in top_spam:
print(f" {feature_names[idx]} ({coef[idx]:.3f})")
print("\nTop ham features:")
for idx in top_ham:
print(f" {feature_names[idx]} ({coef[idx]:.3f})")
Output:
4. (Optional) Kernel SVM
For small datasets or non-linear patterns, sklearn.svm.SVC(kernel='rbf') can be used, but note it scales poorly on large sparse text data.
from sklearn.svm import SVC
model_rbf = make_pipeline(
TfidfVectorizer(),
SVC(kernel='rbf', C=1.0, gamma='scale')
)
# model_rbf.fit(docs, labels)
Continue to Module 4.4: Conditional Random Fields for Sequence Labeling