Z OLD Homework 07 - james-bern/CS136 GitHub Wiki

TODO (Jim): Test case that makes frequencies aren't all 1 (being overwritten incorrectly).

README (Before Lab)

Description

Imagine you want a computer to write you a book, one char at a time. Here is one way that could work.

  • You provide the computer with a String seed; which is how you want the book to start. For example, String seed = "four sco"; In this case, the length of the seed is 8 (the space counts).
  • The computer looks at this seed and asks, Hm, what char would I expect to follow "four sco"? Luckily, the computer has been reading like A LOT, and in its experience, 90% of the time "four sco" is followed by 'r' (four score and seven years ago...) and 10% of the time "four sco" is followed by 'n' (...four scones, four small coffees, for Jim.). In this example, the computer has never seen "four sco" followed by any character other than an 'r' or and 'n'. What now shall it do?
  • The computer could just choose the more likely character ('r'), but that would mean our book is never ever going to get to talk about those four scones, which would be heartbreaking. But wait! Instead of just choosing the most likely character every time, the computer can use ✨probability✨. In this particular case, the computer should choose 'r' 90% of the time and choose 'n' 10% of the time. For the sake of fun (and scones), let's say we get 'n'.
  • The computer appends 'n' to the book it's writing (which starts with the seed), so the book it's writing now says "four scon". Wow!
  • Next, the computer examines the last 8 characters of the book it's writing, and asks Hm, what character do I expect to follow "our scon"?
  • The process continues, and in no time at all, you could have "four scon-mediated cKO approach transforms object is a vertical matrix of NSFormCell objects" A masterpiece!

Functions

buildFrequencyTables(...)

  • String sourceText; is the text the computer read beforehand. It could be the entirety of Moby Dick, or it could literally just be the string "bananas". Remember that song Hollaback Girl? I sure do.
  • The process of the computer "reading" works like this. The computer samples the source text by moving a window of windowLength many characters along the source text. For example, if sourceText = "bananas" and windowLength = 3, then our first substring would be "ban", our second window substring would be "ana", our third substring would be "nan", and so on. Here's a picture:
    > bananas
    > ---
    
    > bananas
    >  ---
    
    > bananas
    >   ---
    
    > bananas
    >    ---
    
  • As the computer moves the window, it pays special attention to what character is immediately to the right of the substring currently in our window. Our first substring "ban" is followed by 'a', our second substring "ana" is followed by 'n', our third substring "nan" is followed by 'a', and so on. Here's a picture:
    > bananas
    > ---^
    
    > bananas
    >  ---^
    
    > bananas
    >   ---^
    
    > bananas
    >    ---^
    
  • The computer stores all this data in a table of tables that we can call frequencyTables.
    • Or HashMap<String, HashMap<Character, Integer>> frequencyTables to be precise.
      • It looks intimidating, but it's really not so bad! frequencyTables.get("ban").get('a') is the number of times the substring "ban" was followed by the character 'a' in the source text.
    > bananas
    > ---^
    
    frequencyTables = {ban={a=1}}
    
    > bananas
    >  ---^
    
    frequencyTables = {ana={n=1}, ban={a=1}}
    
    > bananas
    >   ---^
    
    frequencyTables = {ana={n=1}, nan={a=1}, ban={a=1}}
    
    > bananas
    >    ---^
    
    frequencyTables = {ana={s=1, n=1}, nan={a=1}, ban={a=1}}
    
  • ✨ Note: In practice, we must make our sampling "wrap around."
    • The easiest way to accomplish this is to modify the source text by appending its first windowLength characters to its end.
      • sourceText += sourceText.substring(0, windowLength);
        • For a window length of 3, this simple pre-processing step would turn "bananas" into "bananasban". We would actually create the frequency table for this modified string.
    • The reason we need to do this pre-processing step is very subtle. We will talk about it later on in the README in the context of a concrete example.

sampleFrequencyTable(...)

Imagine randomly sampling a character from the frequency table {a=5, b=3, z=2}. There are 10 = (5 + 3 + 2) total occurrences of characters in the table. The function sampleFrequencyTable(...), called on this particular frequency table, should return 'a' 5/10 times, 'b' 3/10 times, and 'z' 2/10 times.

A naive solution would be to actually build an array of characters ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'z', 'z'], and then choose a random int i uniformly from the interval [0, array.length), i.e., one of these numbers { 0, 1, 2, ..., array.length - 1 }. Then we just return array[i]. πŸ™‚

This naive solution will work, but unfortunately it's reaaaaally inefficient (especially in terms of space). It is so inefficient that it won't score points. πŸ˜”

✨ A better solution is as follows.

We will still choose the random int i, but we will figure out which char it corresponds to without building an array.

  a a a a a b b b z z
i 0 1 2 3 4 5 6 7 8 9 10
            ^     ^   ^
            |     |   | 
            |     |   cutoff for z
            |     |
            |     cutoff for b
            |
            cutoff for a

For this example...

  • if i < 5, then return 'a'
  • if i < 8 = (5 + 3), then return 'b'
  • if i < 10 = (5 + 3 + 2), then return 'z'
βœ¨πŸ‘€HINT (pseudocode) -- Click the triangle to show hint.
// if the frequency table is empty, something went wrong earlier!
assert frequency table is NOT empty

precompute (find beforehand) totalNumberOfOccurences of all characters

// pick i randomly from { 0, 1, ..., (totalNumberOfOccurences - 1) }
i = getRandomInt(totalNumberOfOccurrences) 
cutoff = 0

foreach key in frequencyTable:
    cutoff += frequencyTable[key]
    if (i < cutoff):
        return key

assert false; // code should never get here
return 0; // Java requires me to have a return
βœ¨πŸ‘€HINT How to iterate over the keys of a HashMap.
for (KeyType key : map.keySet()) {
    ...
}

generateTextFromSeed(...)

  • Initialize result (the text being generated) with a seed.
  • Repeat the following, numberOfCharactersToGenerate times:
    1. Grab the last windowLength characters (suffix) of the generated text.
    2. Lookup the frequency table keyed by that suffix.
    3. Sample a character from that frequency table.
    4. Append (add) that character to the generated text.
ban
---^

> frequencyTables["ban"] = {a=1}
> randomly chose character 'a' 

bana
 ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 'n' 

banan

Complete Example with Details

πŸ‘€ Click the eyes to show.
  • sourceText = "bananas"; (which we modify to "wrap around", as described above)
  • windowLength = 3;
  • seed = "ban";
  • numberOfCharactersToGenerate = 8;

Building Frequency Tables

> bananasban
> ---^

frequencyTables = {ban={a=1}}

> bananasban
>  ---^

frequencyTables = {ana={n=1}, ban={a=1}}

> bananasban
>   ---^

frequencyTables = {ana={n=1}, nan={a=1}, ban={a=1}}

> bananasban
>    ---^

frequencyTables = {ana={s=1, n=1}, nan={a=1}, ban={a=1}}

> bananasban
>     ---^

frequencyTables = {nas={b=1}, ana={s=1, n=1}, nan={a=1}, ban={a=1}}

> bananasban
>      ---^

frequencyTables = {nas={b=1}, asb={a=1}, ana={s=1, n=1}, nan={a=1}, ban={a=1}}

> bananasban
>       ---^

frequencyTables = {nas={b=1}, asb={a=1}, sba={n=1}, ana={s=1, n=1}, nan={a=1}, ban={a=1}}

Generating Text

ban
---^

> frequencyTables["ban"] = {a=1}
> randomly chose character 'a' 

bana
 ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 'n' 

banan
  ---^

> frequencyTables["nan"] = {a=1}
> randomly chose character 'a' 

banana
   ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 'n' 

bananan
    ---^

> frequencyTables["nan"] = {a=1}
> randomly chose character 'a' 

bananana
     ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 's' 

banananas
      ---^

> frequencyTables["nas"] = {b=1}
> randomly chose character 'b' 

banananasb
       ---^

> frequencyTables["asb"] = {a=1}
> randomly chose character 'a' 

banananasba

"Wrap-Around" Explanation

Imagine we didn't "wrap-around" when sampling, but instead just dragged our window across "bananas" like this:

> bananas
> ---^

frequencyTables = {ban={a=1}}

> bananas
>  ---^

frequencyTables = {ana={n=1}, ban={a=1}}

> bananas
>   ---^

frequencyTables = {ana={n=1}, nan={a=1}, ban={a=1}}

> bananas
>    ---^

frequencyTables = {ana={s=1, n=1}, nan={a=1}, ban={a=1}}

Now, imagine generating some text from this "non-wrap-around" frequencyTables.

ban
---^

> frequencyTables["ban"] = {a=1}
> randomly chose character 'a' 

bana
 ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 'n' 

banan
  ---^

> frequencyTables["nan"] = {a=1}
> randomly chose character 'a' 

banana
   ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 'n' 

bananan
    ---^

> frequencyTables["nan"] = {a=1}
> randomly chose character 'a' 

bananana
     ---^

> frequencyTables["ana"] = {s=1, n=1}
> randomly chose character 's' 

banananas
      ---^

> frequencyTables["nas"] = null
Exception in thread "main" java.lang.NullPointerException
        at Main.sampleFrequencyTable(HW07.java:36)
        at Main.generateTextFromSeed(HW07.java:62)
        at Main.main(HW07.java:79)

Do you see what went wrong? Everything went fine until the text we were generating had suffix/window "nas". frequencyTables does not contain the key "nas"! This is because "nas" only shows up at the very end of our source text (no characters follow it).

The cute solution to this problem is to make the source text wrap-around like I talked about earlier. If the source text is a big ol' loop, then it has no (dead) end. πŸ™‚πŸ‘

Specification

  • To get a A:
    • βœ… Implement the provided functions in order to make a working text generator. Write code in main(...) to test each function as you go!
      • βœ… buildFrequencyTables(...)
        • 🚨 You MUST make the source text "wrap around" before building the frequency tables
          • ✨ explanation in the README
      • βœ… sampleFrequencyTable(...)
        • 🚨 You must NOT build any massive arrays of characters in order to do this
          • ✨ explanation in the README
      • βœ… generateTextFromSeed(...)
        • Feel free to use either the + operator for string concatenation, or the more efficient StringBuilder class.
    • βœ… Test your full pipeline in main(...).
      • βœ… Generate 100 characters from sourceText = "aaaabbbb" with windowLength = 3 and verify for yourself it looks "right."
        • 🚨 If you get aaaaaaaaaaaaaaaaaaaaaa..., then something is wrong.
      • βœ… Generate 1000 characters sourceText = readFileIntoString("HW07.java", false) with windowLength = 6 and verify for yourself it looks "right."
  • To get an S:
    • What we implemented in this homework could be called a character n-gram language model.
    • βœ… In a separate file, implement a word n-gram language model.
      • ✨ You will need to use/build your Google skills!
      • βœ… Include a boolean smoothing; to toggle (your choice of) smoothing on/off. ✨ For the simplest version of smoothing, you will need a list of all (or at least a bunch of) words in the English language (maybe just all the words from the source text?). You will want there to be a low but nonzero probability of choosing any of these words. (Do NOT store the whole English language in every frequency table. While it would work in principle, this approach would be VERY space-inefficient.)
      • βœ… Include an int k; to make your model a word k-skip-n-gram language model.
        • Note: k = 0; for no skipping
      • βœ… Train your model on a large text (like all of Moby Dick) of your choice and add a comment at the top of your code explaining what you actually observe about the effects of smoothing and skipping.

Submission Instructions

  • Put the output of the tests specified in the Rubric in a comment at the top of your code file.
    • 🚨 You must do this in order to receive points.
    • This is how you do a block comment in Java
      /*
      lorem
      ipsum
      dolor
      */
  • Upload your code file to GradeScope by the due date.
  • Go over your code with Jim or a Grading TA in Lab.

Starter Code

HW07.cpp

import java.util.*;
import java.nio.file.*;

class HW07 {
    static HashMap<String, HashMap<Character, Integer>> buildFrequencyTables(String sourceText, int windowLength) {
        HashMap<String, HashMap<Character, Integer>> result = new HashMap<>();
        // TODO
        return result;
    }

    static char sampleFrequencyTable(HashMap<Character, Integer> frequencyTable) {
        // TODO
        return 'X';
    }

    static String generateTextFromSeed(int windowLength, HashMap<String, HashMap<Character, Integer>> frequencyTables, String seed, int numberOfCharactersToGenerate) {
        String result = seed;
        // TODO
        // NOTE: This function will call sampleFrequencyTable(...)
        return result;
    }


    public static void main(String[] arguments) {
        String sourceText = "aaaabbbb";
        // String sourceText = "bananas";
        // String sourceText = "red leather yellow leather ";
        // String sourceText = readFileIntoString("HW07.java", false);

        int windowLength = 3;
        int numberOfCharactersToGenerate = 100;
        String seed = sourceText.substring(0, windowLength);
        HashMap<String, HashMap<Character, Integer>> frequencyTables = buildFrequencyTables(sourceText, windowLength);
        String generatedText = generateTextFromSeed(windowLength, frequencyTables, seed, numberOfCharactersToGenerate);
        System.out.println(generatedText);
    }

    
    // return a random int in range [0, upperBoundExclusive)
    // e.g., getRandomInt(10) returns a random number from { 0, 1, 2, ..., 9 }
    static int getRandomInt(int upperBoundExclusive) {
        return _random.nextInt(upperBoundExclusive);
    }
    
    // NOTE: As written, the behavior of getRandomInt(...)
    //       will be the same each time you run your program.
    //       To make it different each time...
    //       ...change "new Random(0)" -> "new Random()" below
    static Random _random = new Random(0);

    
    // Reads the contens of a file into a String and returns it.
    // removeExcessWhiteSpace=true condenses multiple spaces and newlines into a single space
    static String readFileIntoString(String fileName, boolean removeExcessWhiteSpace) {
        String result = null;
        try { result = new String(Files.readAllBytes(Paths.get(fileName))); }
        catch (Exception exception) { System.out.println(exception); assert false; }
        if (removeExcessWhiteSpace) { result = result.replaceAll("[ \n]+", " "); }
        return result;
    }
}
⚠️ **GitHub.com Fallback** ⚠️