AI‐24sp‐2024‐04‐25‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

Goals

Today you will have the choice of

Continuing to improve your MNIST classifier
Training a synthetic voice to apply neural networks and back propagation to another domain
- (audio instead of visual)
- We will call this voice-cloning, which is a form of just-in-time finetuning.

If you wish to choose voice-cloning continue with the instructions below.

In either case, you'll continue to understand

Tuning it through hyperparameters to increase the success rate from its initial value
Understanding the model as a file that you can transmit

Ethics Preparation

If you have not done so already. Attempt an answer to the ethics questions from AI Homework 04 in a dev diary entry, in particular:

What standards or precedents will you use to judge your participation in this lab from an ethical point-of-view?

Divide Into Groups

Installing and using the TTS voice synthesis software is very resource intensive and may cause GitPod to throttle your workspace (e.g. this happened to us when training our MNIST classifier).

For example, installing the TTS python package may take more than 20 minutes on GitPod. For this reason, we ask you to work in pairs or triples to reduce our class's footprint on GitPod's cloud servers.

Thanks for your cooperation.

Collect a Dataset

You can choose to finetune the text-to-speech to your voice or a voice of anyone's publicly available recording that you consider it ethical to do so.

(Be sure to justify your choice in your dev diary entries above).

Your Own Voice

Record your voice with your smartphone. You can step outside of lab or go outside to get a more clean recording. You may need to download a third-party app that can export a WAV file.

On iPhone, we've had good luck with an app called Voice Recorder, but we are not affiliated with them and there may be better choices.

A 5 minute recording is sufficient, but you may experiment with longer recordings after your initial attempt to see if it increases the quality.

A Publicy-Available Recorded Voice

You may use the LiveRecorder Firefox plugin to record from Youtube.

Convert from webm to audio-only ogg.

ffmpeg -i your.webm -vn -acodec copy ./your.ogg

We discard the video portion because

it consumes more SSD space
TTS is an audio-training only algorithm

Do video-training algorithms exist? They must, for deepfakes on Youtube to exist.

I'm not as familiar with open source software that exists if any
A survey paper for future reading (https://arxiv.org/ftp/arxiv/papers/2311/2311.06329.pdf)

Split by silence. This automated splitting of files is the killer app for command-line audio tools, imo.

sox ./your.ogg your-.wav silence 1 0.2 0.5% 1 0.2 0.5% : newfile : restart

Having a bunch of short files is most useful for training a model from scratch, which we are not doing today for lack of time and GPU compute resources.

However, splitting the sound into many small WAV files lets you re-concatenate some together to create a file of the right length (5 minutes, or whatever).

You can serve these WAV files with a Python server and use your web browser to download and preview these clips.

python3 -m http.server

Install TTS

Your team has the choice of either using GitPod or using the Coqui Docker image.

GitPod Method

If you don't have your own laptop, or don't have Docker available, we recommend using GitPod for this example to be able to spend more time on voice cloning and fine-tuning, rather than setting up a Python environment on your own laptop.

At a GitPod command prompt, install the TTS python package.

pip3 install TTS

Leave this install process and running and begin reading the TTS guide for synthesizing voices and inference.

Additionally you will need to run these three command:

sudo apt update
sudo apt install libsndfile1 ffmpeg sox
export PATH=$PATH:~/.local/bin

libsnd1 is a library for generating the output WAV file and synthesizing sound.

ffmpeg is a tool for manipulating encoded media files, especially MP3 for compressed sound.

sox is a command-line tool for automated sound editing. It includes play, a command-line sound player which is useful for playing the sound files you generated (mostly WAV).

Docker Method

If you both have your own laptop and an installed Docker available, you may choose the Docker method of installing TTS

https://docs.coqui.ai/en/latest/docker_images.html#basic-inference

While You're Waiting

While you're waiting for installing pip packages or Docker images. Read some usage tips to prepare for what you'll run after the installation is complete.

https://docs.coqui.ai/en/latest/inference.html

and also the YourTTS zero-shot voice cloning software, which is a particular model for TTS that is designed to make finetuning effective.

https://github.com/TheEvergreenStateCollege/upper-division-cs/wiki/AI%E2%80%90Homework%E2%80%9004#human-writing

Test TTS Synthesis

You can list all models like this

tts --list_models

This will list all available models that are roughly named according to:

<category>/<language>/<phoneme_model>/<vocoder_model>

Where most of the models are in the <category> tts_models

You'll see a long list of available models. The first time you use any of them, it will be downloaded.

To start, we recommend the Tacotron DDC as a good basic model to start with.

tts_models/en/ljspeech/tacotron2-DDC

Stock Voice (Not Finetuned)

Try generating some text from the stock voice models that come included with TTS.

tts --text "if it doesn't come bursting out of you in spite of everything, don't do it." --model_name tts_models/en/ljspeech/tacotron2-DDC --out_path output.wav

Your Chosen Voice (Finetuned)

tts  --text "This is an example!" --model_name tts_models/multilingual/multi-dataset/your_tts  --speaker_wav your.wav --language_idx "en"

Where your.wav is a WAV file from your chosen speaker collected above.

Stretch Goals

If you've been able to listen to synthesized outputs of your chosen voice, you've completed the main goal of this lab.

Here are stretch goals to consider:

Provide a longer speaker_wav file to YourTTS. Does it increase the quality of the produced speech?
- Along what dimensions? (Human-ness, emotion, fidelity to the original speaker, legibility)
Can you train on a different sample from the same speaker, expressing a wildly different emotion, or at a different age?

Questions

Document your work progress in your dev diary entry. Try to include enough steps for someone to reproduce your steps, including screenshots and code blocks if necessary.

Attempt these questions in your dev diary.

Has listening to your chosen voice produced an emotional reaction for you? Say as much as you're comfortable.
Has it changed your ethical considerations that you wrote about at the beginning of the lab?
What would you better like to understand about the voice cloning process?