1 Data preparation - iankohdes/examining-one-anime-episodes-subtitles GitHub Wiki
Code snippet:
use std::fs;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file_path: &str = "data/psycho-pass-s01e01-jp.srt";
let raw_content: String = fs::read_to_string(file_path)?;
let normalised_raw_content: String = raw_content.replace("\r\n", "\n");
let raw_content_split: Vec<&str> = split_into_raw_subtitle_units(&normalised_raw_content);
let subtitles: Vec<&str> = raw_content_split
.iter()
.flat_map(|x| get_subtitles_from_unit(x))
.collect();
let subtitles_concat = subtitles.join("");
println!("{subtitles_concat:?}");
Ok(())
}
fn split_into_raw_subtitle_units(raw: &str) -> Vec<&str> {
raw.split("\n\n").collect()
}
fn get_subtitles_from_unit(subtitle_unit: &str) -> Vec<&str> {
subtitle_unit.split('\n').skip(2).collect()
}
This returns a single, concatenated string of all the subtitles from the episode’s subtitle units. A subtitle unit is a grouping of index, timestamps and subtitle(s) in an .srt
file. Below are examples of subtitle units.
8
00:01:54,405 --> 00:01:56,157
一目 見て
分かったはずだ―
9
00:01:57,242 --> 00:01:59,994
2人は
初めて出会うより 以前から…―
10
00:02:00,161 --> 00:02:01,955
ああなる運命だったんだろう―
A single string of text is probably not the best idea if I want to perform tokenisation. The reason is that the last character of one unit could be mistakenly tokenised with the first character of the subsequent unit (since Japanese doesn’t use spaces for word separation).
In this case, however, we’re calculating metrics at the character level, not the word level. We’re thus able to get away with concatenating everything into one string.
For cleaning, we focus on the following:
- Characters that are enclosed within any kind of parentheses
- Unwanted characters (any character that is not kanji, hiragana or katakana)
- Small versions of hiragana and katakana – these need to be ‘upgraded’ to their regular sizes
We make a distinction between string-level (first bullet point) and character-level cleaning (second and third bullet points). I reckon I could write one function for each level of cleaning, then put these inside a public function that I’ll use in main
.
There are two kinds of parentheses used in the subtitles:
- The regular kind:
(
and)
- The Japanese kind:
(
and)
Note that the latter parenthesis type creates a ‘box’ around each parenthesis.
The reason we don’t want anything within these parentheses is that these do not convey any information to the viewer if subtitles were disabled. I notice two applications of parentheses:
- To indicate who’s speaking or making a sound
- To show the hiragana representation of a word to aid in pronunciation
This is standard filtering to remove anything that isn’t kanji, hiragana or katakana. It’s also an important first step to prepare a document for tokenisation (something I’d like to try in future analyses of subtitles).
The first step is to create a character blacklist, and this in turn requires a bit of data exploration. My plan is to retrieve a collection of unique characters and sort them, before printing them out.
With sorting, I’m hoping that their Unicode representation separates punctuation from the actual characters. This way I won’t have to manually scan the collection of unique characters to pick out what I’m looking for.
It’d be easier for alphabet-based languages, as I could simply create a character whitelist and exclude everything outside said list.
There are certain hiragana and katakana characters that have small versions of themselves. These are used to create digraphs and also indicate long vowel sounds (such as in exaggerated or exclamatory utterances).
At first I wasn’t sure whether to delete such characters or include them. I decided to include them because they play a meaningful role in creating sounds, as written above.