Home - Stella2019/study GitHub Wiki

Preprocessing names dataset (cont'd)

Now you have a DataFrame with two columns containing the names with the start and end tokens appended. The next step is to encode these as numeric values because machine learning models only accept numeric inputs.

In this exercise, you'll create two dictionaries, char_to_idx and idx_to_char, that will contain mappings of characters to integers, e.g., {'\t': 0, '\n': 1, 'a': 2, 'b': 3, ...} and the reverse mappings of integers to characters, e.g, {0: '\t', 1: '\n', 2: 'a', 3: 'b', ...}.

The dataset is available in names_df.

We also defined a helper function get_vocabulary() that takes a list of words as an input and returns the vocabulary which is the set of all the characters available the dataset. We used this function and saved the result in the variable vocabulary.