[function] remove_stopwords() - P3chys/textmining GitHub Wiki
remove_stopwords()
Function: Purpose
Filters out stopwords (common words with little semantic meaning) from tokenized text while preserving sentiment-important words for text mining analysis.
Syntax
remove_stopwords(tokens, extra_stopwords=None, keep_sentiment_words=True)
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
tokens |
list | Required | List of word tokens to filter |
extra_stopwords |
list/set or None | None | Additional stopwords to remove beyond standard English stopwords |
keep_sentiment_words |
bool | True | Whether to preserve sentiment-bearing stopwords like "not", "very" |
Returns
- Type: list
- Content: Filtered list of tokens with stopwords removed
Required Import
from nltk.corpus import stopwords
Processing Logic
- Base Stopwords: Uses NLTK's English stopwords set (articles, prepositions, etc.)
- Custom Additions: Merges additional stopwords if provided via
extra_stopwords
- Sentiment Preservation: Optionally retains sentiment-critical words that are normally considered stopwords
keep_sentiment_words=True
)
Sentiment Words Preserved (when Negation Words: 'no', 'not', 'nor', 'none', 'never', 'hardly', 'barely', 'scarcely', 'rarely', 'seldom'
Intensifiers: 'very', 'really', 'quite', 'extremely', 'absolutely', 'completely', 'totally', 'utterly', 'entirely', 'somewhat'
Algorithm Steps
- Create base stopwords set from NLTK
- Add custom stopwords if provided
- Remove sentiment words from stopwords set (if
keep_sentiment_words=True
) - Filter tokens using list comprehension
Usage Examples
# Basic stopword removal
tokens = ["this", "product", "is", "not", "very", "good", "at", "all"]
filtered = remove_stopwords(tokens)
# Result: ["product", "not", "very", "good"]
# With custom stopwords
tokens = ["product", "amazon", "quality", "price"]
custom_stops = ["amazon", "product"]
filtered = remove_stopwords(tokens, extra_stopwords=custom_stops)
# Result: ["quality", "price"]
# Without sentiment word preservation
tokens = ["this", "product", "is", "not", "very", "good"]
filtered = remove_stopwords(tokens, keep_sentiment_words=False)
# Result: ["product", "good"]
# Apply to DataFrame
df['filtered_tokens'] = df['review_tokens'].apply(
lambda x: remove_stopwords(x, keep_sentiment_words=True)
)
Prerequisites
- Requires NLTK library and stopwords corpus
- Download required data:
nltk.download('stopwords')
- Input should be tokenized text (output from
tokenize_text()
)
Design Rationale
- Sentiment preservation: Crucial for sentiment analysis as negations and intensifiers affect meaning
- Customization: Allows domain-specific stopword removal
- Performance: Uses set operations for efficient filtering