Module 4 3 Support Vector Machines - iffatAGheyas/NLP-handbook GitHub Wiki

Module 4.3: Support Vector Machines

Support Vector Machines (SVMs) are max-margin classifiers that find a hyperplane separating classes with the largest possible margin. In text classification, linear SVMs often perform very well on high-dimensional sparse data.

1. Theory Recap

Decision function:

2. scikit-learn LinearSVC on Text

This example builds a pipeline using TF–IDF features and LinearSVC.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline

# Toy dataset
docs   = [
    "cheap meds available now",
    "team meeting at noon",
    "win money win prizes",
    "lunch with project team",
    "exclusive offer just for you",
    "report submission due tomorrow"
]
labels = ['spam','ham','spam','ham','spam','ham']

# Build pipeline: TF-IDF → LinearSVC
model = make_pipeline(
    TfidfVectorizer(ngram_range=(1,2), stop_words='english'),
    LinearSVC(C=1.0, max_iter=10000)
)

# Train & predict
model.fit(docs, labels)
tests = [
    "limited time offer",
    "project team lunch",
    "win exclusive prize",
    "report due today"
]
preds = model.predict(tests)

for text, pred in zip(tests, preds):
    print(f"'{text}' → {pred}")

Output

3. Inspecting Top Features

For a linear SVM, the learned coefficient vector 𝑤 indicates feature importance. Extract the top positive and negative features:

import numpy as np

# Access vectorizer and classifier
vectorizer, clf = model.named_steps['tfidfvectorizer'], model.named_steps['linearsvc']
feature_names = vectorizer.get_feature_names_out()
coef = clf.coef_[0]

# Top 5 spam indicators (largest positive weights)
top_spam = np.argsort(coef)[-5:][::-1]
# Top 5 ham indicators (most negative weights)
top_ham  = np.argsort(coef)[:5]

print("Top spam features:")
for idx in top_spam:
    print(f"  {feature_names[idx]} ({coef[idx]:.3f})")

print("\nTop ham features:")
for idx in top_ham:
    print(f"  {feature_names[idx]} ({coef[idx]:.3f})")

Output:

4. (Optional) Kernel SVM

For small datasets or non-linear patterns, sklearn.svm.SVC(kernel='rbf') can be used, but note it scales poorly on large sparse text data.

from sklearn.svm import SVC
model_rbf = make_pipeline(
    TfidfVectorizer(),
    SVC(kernel='rbf', C=1.0, gamma='scale')
)
# model_rbf.fit(docs, labels)

Continue to Module 4.4: Conditional Random Fields for Sequence Labeling