Analyzing Word Frequency - codepath/compsci_guides GitHub Wiki

Unit 4 Session 2 Standard (Click for link to problem statements)

U-nderstand

Understand what the interviewer is asking for by using test cases and questions about the problem.

  • Q: What is the goal of the problem?
    • A: The goal is to analyze a given text to determine the frequency of each unique word and identify the most frequent word(s).
  • Q: What are the inputs?
    • A: The input is a string of text.
  • Q: What are the outputs?
    • A: The output is a dictionary where keys are words and values are their frequencies, and a list of the most frequent word(s).
  • Q: How should text be processed?
    • A: The text should be treated as case-insensitive, and punctuation should be ignored.
  • Q: What if there is a tie for the most frequent word?
    • A: Return all words that have the highest frequency.

P-lan

Plan the solution with appropriate visualizations and pseudocode.

General Idea: Convert the text to lowercase and remove punctuation. Split the text into words and count the frequency of each word using a dictionary. Then, identify the word(s) with the highest frequency.

1) Convert the entire `text` to lowercase to ensure case insensitivity.
2) Remove punctuation from the text.
3) Split the `text` into individual words.
4) Initialize an empty dictionary `frequency_dict` to store word frequencies.
5) Iterate through the list of words:
   a) If the word is already in `frequency_dict`, increment its count.
   b) If the word is not in `frequency_dict`, add it with a count of 1.
6) Determine the maximum frequency in `frequency_dict`.
7) Initialize a list `most_frequent_words` to store words with the highest frequency.
8) Iterate through `frequency_dict` and add words with the maximum frequency to `most_frequent_words`.
9) Return `frequency_dict` and `most_frequent_words`.

**⚠️ Common Mistakes**

- Not handling punctuation correctly, leading to incorrect word counts.
- Forgetting to account for case insensitivity when counting word frequencies.
- Not correctly identifying all words with the highest frequency in case of ties.

I-mplement

def word_frequency_analysis(text):
    # Convert the text to lowercase and remove punctuation manually
    text = text.lower()
    clean_text = ''
    for char in text:
        if char.isalnum() or char.isspace():
            clean_text += char

    # Split the text into words
    words = clean_text.split()

    # Dictionary to store word frequencies
    frequency_dict = {}

    for word in words:
        if word in frequency_dict:
            frequency_dict[word] += 1
        else:
            frequency_dict[word] = 1

    # Find the maximum frequency without using max
    max_frequency = -1
    most_frequent_words = []

    for word, freq in frequency_dict.items():
        if freq > max_frequency:
            max_frequency = freq
            most_frequent_words = [word]
        elif freq == max_frequency:
            most_frequent_words.append(word)

    return frequency_dict, most_frequent_words
Example Usage:

text = "The quick brown fox jumps over the lazy dog. The dog was not amused."
print(word_frequency_analysis(text))  
# Output: ({'the': 3, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 2, 'was': 1, 'not': 1, 'amused': 1}, ['the'])

text_2 = "Digital nomads love to travel. Travel is their passion."
print(word_frequency_analysis(text_2))  
# Output: ({'digital': 1, 'nomads': 1, 'love': 1, 'to': 1, 'travel': 2, 'is': 1, 'their': 1, 'passion': 1}, ['travel'])

text_3 = "Stay connected. Stay productive. Stay happy."
print(word_frequency_analysis(text_3))  
# Output: ({'stay': 3, 'connected': 1, 'productive': 1, 'happy': 1}, ['stay'])