AI‐24sp‐2024‐04‐25‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Today you will have the choice of
- Continuing to improve your MNIST classifier
- Training a synthetic voice to apply neural networks and back propagation to another domain
- (audio instead of visual)
- We will call this voice-cloning, which is a form of just-in-time finetuning.
If you wish to choose voice-cloning continue with the instructions below.
In either case, you'll continue to understand
- Tuning it through hyperparameters to increase the success rate from its initial value
- Understanding the model as a file that you can transmit
If you have not done so already. Attempt an answer to the ethics questions from AI Homework 04 in a dev diary entry, in particular:
What standards or precedents will you use to judge your participation in this lab from an ethical point-of-view?
Installing and using the TTS voice synthesis software is very resource intensive and may cause GitPod to throttle your workspace (e.g. this happened to us when training our MNIST classifier).
For example, installing the TTS python package may take more than 20 minutes on GitPod. For this reason, we ask you to work in pairs or triples to reduce our class's footprint on GitPod's cloud servers.
Thanks for your cooperation.
You can choose to finetune the text-to-speech to your voice or a voice of anyone's publicly available recording that you consider it ethical to do so.
(Be sure to justify your choice in your dev diary entries above).
Record your voice with your smartphone. You can step outside of lab or go outside to get a more clean recording. You may need to download a third-party app that can export a WAV file.
On iPhone, we've had good luck with an app called Voice Recorder, but we are not affiliated with them and there may be better choices.
A 5 minute recording is sufficient, but you may experiment with longer recordings after your initial attempt to see if it increases the quality.
You may use the LiveRecorder Firefox plugin to record from Youtube.
Convert from webm to audio-only ogg.
ffmpeg -i your.webm -vn -acodec copy ./your.ogg
We discard the video portion because
- it consumes more SSD space
- TTS is an audio-training only algorithm
Do video-training algorithms exist? They must, for deepfakes on Youtube to exist.
- I'm not as familiar with open source software that exists if any
- A survey paper for future reading (https://arxiv.org/ftp/arxiv/papers/2311/2311.06329.pdf)
Split by silence. This automated splitting of files is the killer app for command-line audio tools, imo.
sox ./your.ogg your-.wav silence 1 0.2 0.5% 1 0.2 0.5% : newfile : restart
Having a bunch of short files is most useful for training a model from scratch, which we are not doing today for lack of time and GPU compute resources.
However, splitting the sound into many small WAV files lets you re-concatenate some together to create a file of the right length (5 minutes, or whatever).
You can serve these WAV files with a Python server and use your web browser to download and preview these clips.
python3 -m http.server
Your team has the choice of either using GitPod or using the Coqui Docker image.
If you don't have your own laptop, or don't have Docker available, we recommend using GitPod for this example to be able to spend more time on voice cloning and fine-tuning, rather than setting up a Python environment on your own laptop.
At a GitPod command prompt, install the TTS python package.
pip3 install TTS
Leave this install process and running and begin reading the TTS guide for synthesizing voices and inference.
Additionally you will need to run these three command:
sudo apt update
sudo apt install libsndfile1 ffmpeg sox
export PATH=$PATH:~/.local/bin
libsnd1
is a library for generating the output WAV file and synthesizing sound.
ffmpeg
is a tool for manipulating encoded media files, especially MP3 for compressed sound.
sox
is a command-line tool for automated sound editing. It includes play
, a command-line sound player which is
useful for playing the sound files you generated (mostly WAV).
If you both have your own laptop and an installed Docker available, you may choose the Docker method of installing TTS
https://docs.coqui.ai/en/latest/docker_images.html#basic-inference
While you're waiting for installing pip packages or Docker images. Read some usage tips to prepare for what you'll run after the installation is complete.
https://docs.coqui.ai/en/latest/inference.html
and also the YourTTS zero-shot voice cloning software, which is a particular model for TTS that is designed to make finetuning effective.
https://github.com/TheEvergreenStateCollege/upper-division-cs/wiki/AI%E2%80%90Homework%E2%80%9004#human-writing
You can list all models like this
tts --list_models
This will list all available models that are roughly named according to:
<category>/<language>/<phoneme_model>/<vocoder_model>
Where most of the models are in the <category>
tts_models
You'll see a long list of available models. The first time you use any of them, it will be downloaded.
To start, we recommend the Tacotron DDC as a good basic model to start with.
tts_models/en/ljspeech/tacotron2-DDC
Try generating some text from the stock voice models that come included with TTS.
tts --text "if it doesn't come bursting out of you in spite of everything, don't do it." --model_name tts_models/en/ljspeech/tacotron2-DDC --out_path output.wav
tts --text "This is an example!" --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav your.wav --language_idx "en"
Where your.wav
is a WAV file from your chosen speaker collected above.
If you've been able to listen to synthesized outputs of your chosen voice, you've completed the main goal of this lab.
Here are stretch goals to consider:
- Provide a longer
speaker_wav
file to YourTTS. Does it increase the quality of the produced speech?- Along what dimensions? (Human-ness, emotion, fidelity to the original speaker, legibility)
- Can you train on a different sample from the same speaker, expressing a wildly different emotion, or at a different age?
Document your work progress in your dev diary entry. Try to include enough steps for someone to reproduce your steps, including screenshots and code blocks if necessary.
Attempt these questions in your dev diary.
- Has listening to your chosen voice produced an emotional reaction for you? Say as much as you're comfortable.
- Has it changed your ethical considerations that you wrote about at the beginning of the lab?
- What would you better like to understand about the voice cloning process?