HW07 - james-bern/CS136 GitHub Wiki

README

Background

Imagine you want a computer to write you a book.

Here is one way that could work, called a character n-gram language model:

  • You provide the computer with a String seed; which is how you want the book to start. For example, String seed = "four sco"; In this case, the length of the seed is 8 (the space counts).
  • The computer looks at this seed and asks, Hm, what char would I expect to follow "four sco"? Luckily, the computer has been reading like A LOT, and in its experience, 90% of the time "four sco" is followed by 'r' (four score and seven years ago...) and 10% of the time "four sco" is followed by 'n' (...four scones, four small coffees, for Jim.). In this example, the computer has never seen "four sco" followed by any character other than an 'r' or and 'n'. What now shall it do?
  • The computer could just choose the more likely character ('r'), but that would mean our book is never ever going to get to talk about those four scones, which would be heartbreaking. But wait! Instead of just choosing the most likely character every time, the computer can use ✨probability✨. In this particular case, the computer should choose 'r' 90% of the time and choose 'n' 10% of the time. For the sake of fun (and scones), let's say we get 'n'.
  • The computer appends 'n' to the book it's writing (which starts with the seed), so the book it's writing now says "four scon". Wow!
  • Next, the computer examines the last 8 characters of the book it's writing, and asks Hm, what character do I expect to follow "our scon"?
  • The process continues, and in no time at all, you could have "four scon-mediated cKO approach transforms object is a vertical matrix of NSFormCell objects" A masterpiece!

Java

Iterating over the keys in a HashMap

// NOTE: The keys can be in any order
for (KeyType key : map.keySet()) {
    ...
}

+ vs. StringBuilder

TODO

Writing to a file

TODO

Pipeline

buildFrequencyTables(String sourceText, int windowLength)

  • The sourceText is the text the computer read beforehand. It could be the entirety of Moby Dick, or it could literally just be the string "bananas". Remember that song Hollaback Girl? I sure do.

  • The computer samples the source text by moving a window of windowLength many characters along the source text.

    • For example, if sourceText = "bananas" and windowLength = 2, then our first window would be "ba", our second window would be "an", our third window would be "na", and so on.
    > bananasba
    > --
    
    > bananasba
    >  --
    
    > bananasba
    >   --
    
  • As the computer moves the window, it pays special attention to what character is immediately to the right of the substring currently in our window.

    • For example, if sourceText = "bananas" and windowLength = 2, then our first substring "ba" is followed by 'n', our second substring "an" is followed by 'a', our third substring "na" is followed by 'n', and so on. Here's a picture:
    > bananasba
    > --^
    
    > bananasba
    >  --^
    
    > bananasba
    >   --^
    
  • The computer stores all this data in frequencyTables

    • Specifically, HashMap<String, HashMap<Character, Integer>> frequencyTables
    • This looks intimidating, but it's really not so bad!
      • frequencyTables.get("ba").get('n') is the number of times "ba" was followed by 'n' in the source text
  • ✨ We have to make our source text wrap around like a circle (see Appendix for explanation; this reading is non-optional)

    • The easiest way to accomplish this is to modify the source text by appending its first windowLength characters to its end.

sampleFrequencyTable(HashMap<Character, Integer> frequencyTable)

  • Imagine randomly sampling a character from the frequency table {a=5, b=3, z=2}.

    • There are 10 = (5 + 3 + 2) total occurrences of characters in the table.
    • sampleFrequencyTable({a=5, b=3, z=2}) should return:
      • 'a' 5/10 = 50% of the time
      • 'b' 3/10 = 30% of the time
      • 'z' 2/10 = 20% of the time
  • A naive solution would be to actually build an array of characters ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'z', 'z'], and then choose a random int i uniformly from the interval [0, array.length), i.e., one of these numbers { 0, 1, 2, ..., array.length - 1 }. Then we just return array[i]. 🙂

    • This naive solution will work, but unfortunately it's reaaaaally inefficient (especially in terms of space). It is so inefficient that it won't score points. 😔
  • ✨ A better solution is to figure out which char i corresponds to without ever actually building an array.

    • if i < 5, then return 'a'
    • if i < 8 = (5 + 3), then return 'b'
    • if i < 10 = (5 + 3 + 2), then return 'z'
      a a a a a b b b z z
    i 0 1 2 3 4 5 6 7 8 9 10
                ^     ^   ^
                |     |   | 
                |     |   cutoff for z
                |     |
                |     cutoff for b
                |
                cutoff for a
    

generateTextFromSeed(String seed, int numberOfCharactersToGenerate, ...)

  • Initialize result (the text being generated) with a seed.
  • Repeat the following, numberOfCharactersToGenerate times:
    1. Grab the last windowLength characters (suffix) of the generated text.
    2. Lookup the frequency table keyed by that suffix.
    3. Sample a character from that frequency table.
    4. Append (add) that character to the generated text.

Examples

buildFrequencyTables("bananas", 2)
--- BEGIN BUILDING FREQUENCY TABLES ----------
+++ sourceText = "bananas"
+++ windowLength = 2

> bananasba
> --^

frequencyTables = {ba={n=1}}

> bananasba
>  --^

frequencyTables = {an={a=1}, ba={n=1}}

> bananasba
>   --^

frequencyTables = {na={n=1}, an={a=1}, ba={n=1}}

> bananasba
>    --^

frequencyTables = {na={n=1}, an={a=2}, ba={n=1}}

> bananasba
>     --^

frequencyTables = {na={s=1, n=1}, an={a=2}, ba={n=1}}

> bananasba
>      --^

frequencyTables = {as={b=1}, na={s=1, n=1}, an={a=2}, ba={n=1}}

> bananasba
>       --^

frequencyTables = {as={b=1}, na={s=1, n=1}, an={a=2}, ba={n=1}, sb={a=1}}

--- END BUILDING FREQUENCY TABLES ------------
generateTextFromSeed("ba", 8, ...)
--- BEGIN GENERATING TEXT FROM SEED ----------
+++ seed = "ba"
+++ numberOfCharactersToGenerate = 8

ba
--^

> frequencyTables["ba"] = {n=1}
> randomly chose character 'n' 

ban
 --^

> frequencyTables["an"] = {a=2}
> randomly chose character 'a' 

bana
  --^

> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's' 

banas
   --^

> frequencyTables["as"] = {b=1}
> randomly chose character 'b' 

banasb
    --^

> frequencyTables["sb"] = {a=1}
> randomly chose character 'a' 

banasba
     --^

> frequencyTables["ba"] = {n=1}
> randomly chose character 'n' 

banasban
      --^

> frequencyTables["an"] = {a=2}
> randomly chose character 'a' 

banasbana
       --^

> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's' 

banasbanas

--- END GENERATING TEXT FROM SEED ------------

TODO's

  • A-
    • Implement a working text generator.
      • Implement buildFrequencyTables(...)
      • Implement sampleFrequencyTable(...)
      • Implement generateTextFromSeed(...)
    • Test your implementation
      • Test buildFrequencyTables(...) and generateTextFromSeed(...) on the (sourceText = "bananas", windowLength = 2) example
      • Test your full pipeline in main(...)
        • Generate 100 characters from sourceText = "aaaabbbb" with windowLength = 3 and verify for yourself it looks "right"
          • NOTE: If you get aaaaaaaaaaaaaaaaaaaaaa..., then something is wrong.
        • Generate 1000 characters sourceText = readFileIntoString("HW07.java", false) with windowLength = 6 and verify for yourself it looks "right"
  • A
    • Use StringBuilder instead of the +
    • Test sampleFrequencyTable(...) on the {a=5, b=3, z=2} example by generating a million samples and graphing them in Excel (or similar) using a bar chart
      • 💁 Include this graph in your Glow submission
  • A+
    • NOTE: What we implemented in this homework could be called a character n-gram language model.
    • In a separate file, implement a word n-gram language model.
      • ✨ You will need to use/build your Google skills!
      • Include a boolean smoothing; to toggle (your choice of) smoothing on/off. ✨ For the simplest version of smoothing, you will need a list of all (or at least a bunch of) words in the English language (maybe just all the words from the source text?). You will want there to be a low but nonzero probability of choosing any of these words. (Do NOT store the whole English language in every frequency table. While it would work in principle, this approach would be VERY space-inefficient.)
      • Include an int k; to make your model a word k-skip-n-gram language model.
        • Note: k = 0; for no skipping
      • Train your model on a large text (like all of Moby Dick) of your choice and add a comment at the top of your code explaining what you actually observe about the effects of smoothing and skipping.

NOTE's

  • You MUST make your source text "wrap around" before building the frequency tables
  • You may NOT build any massive arrays of characters when sampleing a frequency tabled
  • You may NOT build any massive arrays of characters when sampleing a frequency tabled

Starter Code

HW07.cpp

import java.util.*;
import java.nio.file.*;

class HW07 {
    static HashMap<String, HashMap<Character, Integer>> buildFrequencyTables(String sourceText, int windowLength) {
        HashMap<String, HashMap<Character, Integer>> result = new HashMap<>();
        // TODO
        return result;
    }

    static char sampleFrequencyTable(HashMap<Character, Integer> frequencyTable) {
        // TODO
        return 'X';
    }

    static String generateTextFromSeed(String seed, int numberOfCharactersToGenerate, ...) {
        String result = seed;
        // TODO
        // NOTE: This function will call sampleFrequencyTable(...)
        return result;
    }


    public static void main(String[] arguments) {
        String sourceText = "aaaabbbb";
        // String sourceText = "bananas";
        // String sourceText = "red leather yellow leather ";
        // String sourceText = readFileIntoString("HW07.java", false);

        int windowLength = 3;
        int numberOfCharactersToGenerate = 100;
        String seed = sourceText.substring(0, windowLength);
        HashMap<String, HashMap<Character, Integer>> frequencyTables = buildFrequencyTables(sourceText, windowLength);
        String generatedText = generateTextFromSeed(windowLength, frequencyTables, seed, numberOfCharactersToGenerate);
        System.out.println(generatedText);
    }

    
    // return a random int in range [0, upperBoundExclusive)
    // e.g., getRandomInt(10) returns a random number from { 0, 1, 2, ..., 9 }
    static int getRandomInt(int upperBoundExclusive) {
        return _random.nextInt(upperBoundExclusive);
    }
    
    // NOTE: As written, the behavior of getRandomInt(...)
    //       will be the same each time you run your program.
    //       To make it different each time...
    //       ...change "new Random(0)" -> "new Random()" below
    static Random _random = new Random(0);

    
    // Reads the contens of a file into a String and returns it.
    // removeExcessWhiteSpace=true condenses multiple spaces and newlines into a single space
    static String readFileIntoString(String fileName, boolean removeExcessWhiteSpace) {
        String result = null;
        try { result = new String(Files.readAllBytes(Paths.get(fileName))); }
        catch (Exception exception) { System.out.println(exception); assert false; }
        if (removeExcessWhiteSpace) { result = result.replaceAll("[ \n]+", " "); }
        return result;
    }
}

Hints

Make source text wrap around ```java sourceText += sourceText.substring(0, windowLength); ```
  • Example: For a window length of 2, this turns "bananas" into "bananasba".
sampleFrequencyTable(...)
// if the frequency table is empty, something went wrong earlier!
assert frequency table is NOT empty

precompute (find beforehand) totalNumberOfOccurences of all characters

// pick i randomly from { 0, 1, ..., (totalNumberOfOccurences - 1) }
i = getRandomInt(totalNumberOfOccurrences) 
cutoff = 0

foreach key in frequencyTable:
    cutoff += frequencyTable[key]
    if (i < cutoff):
        return key

ASSERT(false); // code should never get here
return 0; // Java requires me to have a return

Apendix

Explanation of why we make the source wrap-around

Imagine we didn't "wrap-around" when sampling, but instead just dragged our window across "bananas" like this:

--- BEGIN BUILDING FREQUENCY TABLES ----------
+++ sourceText = "bananas"
+++ windowLength = 2

> bananas
> --^

frequencyTables = {ba={n=1}}

> bananas
>  --^

frequencyTables = {an={a=1}, ba={n=1}}

> bananas
>   --^

frequencyTables = {na={n=1}, an={a=1}, ba={n=1}}

> bananas
>    --^

frequencyTables = {na={n=1}, an={a=2}, ba={n=1}}

> bananas
>     --^

frequencyTables = {na={s=1, n=1}, an={a=2}, ba={n=1}}

--- END BUILDING FREQUENCY TABLES ------------

Now, imagine generating some text from this "non-wrap-around" frequencyTables.

--- BEGIN GENERATING TEXT FROM SEED ----------
+++ seed = "ba"
+++ numberOfCharactersToGenerate = 8

ba
--^

> frequencyTables["ba"] = {n=1}
> randomly chose character 'n' 

ban
 --^

> frequencyTables["an"] = {a=2}
> randomly chose character 'a' 

bana
  --^

> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's' 

banas
   --^

> frequencyTables["as"] = null
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.util.HashMap.keySet()" because "<parameter1>" is null
        at HW07A.sampleFrequencyTable(HW07A.java:50)
        at HW07A.generateTextFromSeed(HW07A.java:87)
        at HW07A.main(HW07A.java:116)

Do you see what went wrong? Everything went fine until the text we were generating had suffix/window "as". frequencyTables does not contain the key "as"! This is because "as" only shows up at the very end of our source text (no characters follow it).

The cute solution to this problem is to make the source text wrap-around like I talked about earlier. If the source text is a big ol' loop, then it has no (dead) end. 🙂👍

⚠️ **GitHub.com Fallback** ⚠️