Analyzing Word Frequency - codepath/compsci_guides GitHub Wiki
Unit 4 Session 2 Standard (Click for link to problem statements)
U-nderstand
Understand what the interviewer is asking for by using test cases and questions about the problem.
- Q: What is the goal of the problem?
- A: The goal is to analyze a given text to determine the frequency of each unique word and identify the most frequent word(s).
- Q: What are the inputs?
- A: The input is a string of text.
- Q: What are the outputs?
- A: The output is a dictionary where keys are words and values are their frequencies, and a list of the most frequent word(s).
- Q: How should text be processed?
- A: The text should be treated as case-insensitive, and punctuation should be ignored.
- Q: What if there is a tie for the most frequent word?
- A: Return all words that have the highest frequency.
P-lan
Plan the solution with appropriate visualizations and pseudocode.
General Idea: Convert the text to lowercase and remove punctuation. Split the text into words and count the frequency of each word using a dictionary. Then, identify the word(s) with the highest frequency.
1) Convert the entire `text` to lowercase to ensure case insensitivity.
2) Remove punctuation from the text.
3) Split the `text` into individual words.
4) Initialize an empty dictionary `frequency_dict` to store word frequencies.
5) Iterate through the list of words:
a) If the word is already in `frequency_dict`, increment its count.
b) If the word is not in `frequency_dict`, add it with a count of 1.
6) Determine the maximum frequency in `frequency_dict`.
7) Initialize a list `most_frequent_words` to store words with the highest frequency.
8) Iterate through `frequency_dict` and add words with the maximum frequency to `most_frequent_words`.
9) Return `frequency_dict` and `most_frequent_words`.
**⚠️ Common Mistakes**
- Not handling punctuation correctly, leading to incorrect word counts.
- Forgetting to account for case insensitivity when counting word frequencies.
- Not correctly identifying all words with the highest frequency in case of ties.
I-mplement
def word_frequency_analysis(text):
# Convert the text to lowercase and remove punctuation manually
text = text.lower()
clean_text = ''
for char in text:
if char.isalnum() or char.isspace():
clean_text += char
# Split the text into words
words = clean_text.split()
# Dictionary to store word frequencies
frequency_dict = {}
for word in words:
if word in frequency_dict:
frequency_dict[word] += 1
else:
frequency_dict[word] = 1
# Find the maximum frequency without using max
max_frequency = -1
most_frequent_words = []
for word, freq in frequency_dict.items():
if freq > max_frequency:
max_frequency = freq
most_frequent_words = [word]
elif freq == max_frequency:
most_frequent_words.append(word)
return frequency_dict, most_frequent_words
Example Usage:
text = "The quick brown fox jumps over the lazy dog. The dog was not amused."
print(word_frequency_analysis(text))
# Output: ({'the': 3, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 2, 'was': 1, 'not': 1, 'amused': 1}, ['the'])
text_2 = "Digital nomads love to travel. Travel is their passion."
print(word_frequency_analysis(text_2))
# Output: ({'digital': 1, 'nomads': 1, 'love': 1, 'to': 1, 'travel': 2, 'is': 1, 'their': 1, 'passion': 1}, ['travel'])
text_3 = "Stay connected. Stay productive. Stay happy."
print(word_frequency_analysis(text_3))
# Output: ({'stay': 3, 'connected': 1, 'productive': 1, 'happy': 1}, ['stay'])