[function] remove_stopwords() - P3chys/textmining GitHub Wiki

Function: remove_stopwords()

Purpose

Filters out stopwords (common words with little semantic meaning) from tokenized text while preserving sentiment-important words for text mining analysis.

Syntax

remove_stopwords(tokens, extra_stopwords=None, keep_sentiment_words=True)

Parameters

Parameter Type Default Description
tokens list Required List of word tokens to filter
extra_stopwords list/set or None None Additional stopwords to remove beyond standard English stopwords
keep_sentiment_words bool True Whether to preserve sentiment-bearing stopwords like "not", "very"

Returns

  • Type: list
  • Content: Filtered list of tokens with stopwords removed

Required Import

from nltk.corpus import stopwords

Processing Logic

  1. Base Stopwords: Uses NLTK's English stopwords set (articles, prepositions, etc.)
  2. Custom Additions: Merges additional stopwords if provided via extra_stopwords
  3. Sentiment Preservation: Optionally retains sentiment-critical words that are normally considered stopwords

Sentiment Words Preserved (when keep_sentiment_words=True)

Negation Words: 'no', 'not', 'nor', 'none', 'never', 'hardly', 'barely', 'scarcely', 'rarely', 'seldom'

Intensifiers: 'very', 'really', 'quite', 'extremely', 'absolutely', 'completely', 'totally', 'utterly', 'entirely', 'somewhat'

Algorithm Steps

  1. Create base stopwords set from NLTK
  2. Add custom stopwords if provided
  3. Remove sentiment words from stopwords set (if keep_sentiment_words=True)
  4. Filter tokens using list comprehension

Usage Examples

# Basic stopword removal
tokens = ["this", "product", "is", "not", "very", "good", "at", "all"]
filtered = remove_stopwords(tokens)
# Result: ["product", "not", "very", "good"]

# With custom stopwords
tokens = ["product", "amazon", "quality", "price"]
custom_stops = ["amazon", "product"]
filtered = remove_stopwords(tokens, extra_stopwords=custom_stops)
# Result: ["quality", "price"]

# Without sentiment word preservation
tokens = ["this", "product", "is", "not", "very", "good"]
filtered = remove_stopwords(tokens, keep_sentiment_words=False)
# Result: ["product", "good"]

# Apply to DataFrame
df['filtered_tokens'] = df['review_tokens'].apply(
    lambda x: remove_stopwords(x, keep_sentiment_words=True)
)

Prerequisites

  • Requires NLTK library and stopwords corpus
  • Download required data: nltk.download('stopwords')
  • Input should be tokenized text (output from tokenize_text())

Design Rationale

  • Sentiment preservation: Crucial for sentiment analysis as negations and intensifiers affect meaning
  • Customization: Allows domain-specific stopword removal
  • Performance: Uses set operations for efficient filtering