[function] remove_stopwords() - P3chys/textmining GitHub Wiki

Function: `remove_stopwords()`

Purpose

Filters out stopwords (common words with little semantic meaning) from tokenized text while preserving sentiment-important words for text mining analysis.

Syntax

remove_stopwords(tokens, extra_stopwords=None, keep_sentiment_words=True)

Parameters

Parameter	Type	Default	Description
`tokens`	list	Required	List of word tokens to filter
`extra_stopwords`	list/set or None	None	Additional stopwords to remove beyond standard English stopwords
`keep_sentiment_words`	bool	True	Whether to preserve sentiment-bearing stopwords like "not", "very"

Returns

Type: list
Content: Filtered list of tokens with stopwords removed

Required Import

from nltk.corpus import stopwords

Processing Logic

Base Stopwords: Uses NLTK's English stopwords set (articles, prepositions, etc.)
Custom Additions: Merges additional stopwords if provided via extra_stopwords
Sentiment Preservation: Optionally retains sentiment-critical words that are normally considered stopwords

Sentiment Words Preserved (when `keep_sentiment_words=True`)

Negation Words: 'no', 'not', 'nor', 'none', 'never', 'hardly', 'barely', 'scarcely', 'rarely', 'seldom'

Intensifiers: 'very', 'really', 'quite', 'extremely', 'absolutely', 'completely', 'totally', 'utterly', 'entirely', 'somewhat'

Algorithm Steps

Create base stopwords set from NLTK
Add custom stopwords if provided
Remove sentiment words from stopwords set (if keep_sentiment_words=True)
Filter tokens using list comprehension

Usage Examples

# Basic stopword removal
tokens = ["this", "product", "is", "not", "very", "good", "at", "all"]
filtered = remove_stopwords(tokens)
# Result: ["product", "not", "very", "good"]

# With custom stopwords
tokens = ["product", "amazon", "quality", "price"]
custom_stops = ["amazon", "product"]
filtered = remove_stopwords(tokens, extra_stopwords=custom_stops)
# Result: ["quality", "price"]

# Without sentiment word preservation
tokens = ["this", "product", "is", "not", "very", "good"]
filtered = remove_stopwords(tokens, keep_sentiment_words=False)
# Result: ["product", "good"]

# Apply to DataFrame
df['filtered_tokens'] = df['review_tokens'].apply(
    lambda x: remove_stopwords(x, keep_sentiment_words=True)
)

Prerequisites

Requires NLTK library and stopwords corpus
Download required data: nltk.download('stopwords')
Input should be tokenized text (output from tokenize_text())

Design Rationale

Sentiment preservation: Crucial for sentiment analysis as negations and intensifiers affect meaning
Customization: Allows domain-specific stopword removal
Performance: Uses set operations for efficient filtering