AI‐24sp‐2024‐04‐24‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Back to AI Self-Hosting Spring 2024
Review of Gradient Descent
The Hiking on a Landscape Metaphor
https://sebastianraschka.com/faq/docs/gradient-optimization.html
Questions
Finetuning Demo
What is finetuning? Any kind of additional training to "improve" a model with a second dataset, that is "close to" the original dataset, that is usually faster and less expensive.
For example: if you have a model that recognizes Spanish, you can finetune it to recognize Italian more easily than languages. For our MNIST classifier: we could probably finetune it to recognize handwritten letters as well as numeric digits, with the right dataset.
We will do some or most parts of this activity tomorrow in Thursday afternoon lab.
Warning: you'll be hearing synthetically generated sounds that approximate a human voice. Some may fall into the "uncanny valley" and be disturbing to hear.
Using Mozilla's open source Text-To-Speech (TTS) library which are since been Commercialized as Coqui
graph LR
Capturing ---> Training
Training ---> Synthesis
Synthesis ---> VoiceCloning
Step 0. Ethics
What are concerns about training a model on a public figure's voice?
What are some news events or incidents you've heard about when AI was used to impersonate a person's voice?
What about other aspects of their unique personhood?
How does this relate to our reading in Week 1 about the actor Peter Cushing?
Step 1. Capture a Large Dataset
Large means roughly 100 sound files, although you can judge later whether this is sufficient.
I captured the audio of Warren Buffett (and Charlie Munger) speaking at the 1994 Berkshire Hathaway annual meeting from this YouTube video.
Following https://stackoverflow.com/a/66307612
Use the LiveRecorder
Firefox plugin to record from Youtube.
Convert from webm to audio-only ogg.
ffmpeg -i berkshire-1994.webm -vn -acodec copy ./berkshire-1994-00:00_10:00.ogg
We discard the video portion because
- it consumes more SSD space
- TTS is an audio-training only algorithm
Do video-training algorithms exist? They must, for deepfakes on Youtube to exist.
- I'm not as familiar with open source software that exists if any
- A survey paper for future reading (https://arxiv.org/ftp/arxiv/papers/2311/2311.06329.pdf)
Split by silence. This automated splitting of files is the killer app for command-line audio tools, imo.
sox ./berkshire-1994-00:00_10:00.ogg berkshire-1994-00:00_10:00-.wav silence 1 0.2 0.5% 1 0.2 0.5% : newfile : restart
Play with trimming (start and duration)
then when you narrow down the clip, you can use sox
to trim.
This is where GUIs like Audacity are still better.
play ./berkshire-1994-00:00_10:00-149.wav trim 0:00 0:02.5
sox ./berkshire-1994-00:00_10:00-149.wav ./berkshire-1994-00:00_10:00-149a.wav trim 0:00 0:02.5
https://stackoverflow.com/questions/9667081/how-do-you-trim-the-audio-files-end-using-sox
https://unix.stackexchange.com/questions/381890/play-audio-file-from-a-certain-time-step-in-terminal
Videos split by silence and renormalized
ls /Users/cryptogoth/src/value-investors-tts/voice-training-sets/berkshire-meetings/1994/wavs/
Step 2. Training
I originally collected on Google Colab with GPU / TPU credits. I spent about $88 over April 2023 attempting to train the models you'll hear in Step 3, with NVIDIA H100s.
Here's the beginning of one such training run
Training Environment:
| > Current device: 0
| > Num. of GPUs: 1
| > Num. of CPUs: 2
| > Num. of Torch Threads: 1
| > Torch seed: 54321
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
> Start Tensorboard: tensorboard --logdir=output/run-May-12-2023_06+01AM-0429ab9
> Model has 28259417 parameters
ESC[4mESC[1m > EPOCH: 0/1000ESC[0m
--> output/run-May-12-2023_06+01AM-0429ab9
ESC[1m > TRAINING (2023-05-12 06:01:10) ESC[0m
ESC[1m --> STEP: 0/2 -- GLOBAL_STEP: 0ESC[0m
| > decoder_loss: 28.89919 (28.89919)
| > postnet_loss: 31.12150 (31.12150)
| > stopnet_loss: 0.77770 (0.77770)
| > loss: 15.78287 (15.78287)
| > align_error: 0.95760 (0.95760)
| > grad_norm: 7.37357 (7.37357)
| > current_lr: 0.00000
| > step_time: 2.27920 (2.27921)
| > loader_time: 1.91060 (1.91060)
The loss (cost function) at the beginning is around 15.78.
The example output below is from attempting to train again on an NVIDIA 3080 RTX in my office. Unfortunately it crashed after training began, so I'll have to do more debugging to figure out the issue before the class can run projects on it.
It's probably an insufficient or unstable power issue, or we need a way to throttle the GPU speed if it's been overclocked or PyTorch runs it at unstable speeds for any reason.
> Num. of GPUs: 1
| > Num. of CPUs: 12
| > Num. of Torch Threads: 6
| > Torch seed: 54321
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
| > Torch TF32 MatMul: False
> Start Tensorboard: tensorboard --logdir=output/run-April-23-2024_11+16PM-8a7d076
> Model has 28259417 parameters
> EPOCH: 0/1000
--> output/run-April-23-2024_11+16PM-8a7d076
> DataLoader initialization
| > Tokenizer:
| > add_blank: False
| > use_eos_bos: False
| > use_phonemes: True
| > phonemizer:
| > phoneme language: en-us
| > phoneme backend: gruut
| > Number of instances : 100
| > Preprocessing samples
| > Max text length: 145
| > Min text length: 4
| > Avg text length: 47.47
|
| > Max audio length: 217876.0
| > Min audio length: 9063.0
| > Avg audio length: 47043.49
| > Num. instances discarded samples: 0
| > Batch group size: 0.
> TRAINING (2024-04-23 23:16:58)
fɚst əv ɔl fɹʌm ðə nəbɹæskə fɚnɪt͡ʃɚ mɑɹt
[!] Character '͡' not found in the vocabulary. Discarding it.
kʌm æz njuz tə t͡ʃɑɹli. aɪ hævənt toʊld hɪm jɛt bʌt
[!] Character '͡' not found in the vocabulary. Discarding it.
You'll notice from your experience with MNIST, that you're able to understand a few key pieces of this training run, and indeed, any training run.
- How many epochs did this job run? (In MNIST, we started with 30 by default)
- How many audio files were in the input training set?
- This one's a little harder to figure out, but look for the term "Number of instances"
- (In MNIST, it was 60,000 images)
- What would it be for audio training?
- What is the length of typical input training examples?
- In MNIST, all training images were 28x28 pixels
Unfortunately, as you may have guessed, 100 training samples is not nearly enough.
I'll play a sample audio of what a Warren Buffett model on just 100 training samples, after 22 epochs, sounds like.
(22 epochs was all that $88 would pay for, over about 3 or 4 days of leaving a Google Colab instance running).
The loss is now 0.14, down from 15.78
/Users/cryptogoth/src/warren-buffett-speech/1994/loss_0_14342.wav
Step 3. Synthesis
This seemed much harder, so for awhile, I was content to just use pre-trained models (of which there are many high-quality, royalty-free ones that come with TTS), including various European, British, Irish, and Australian English models.
This is what Speechify and similar startups do, that essentially turn any document into a professional audiobook on-demand.
(As you may guess, voices for other languages exist as well, but they were trained separately since those speakers were speaking non-English words from a different vocabulary).
To test the quality, I had two different models speak the 1994 Berkshire Hathaway annual letter, a document that gets released before the annual meeting every year and is required by the SEC, the government body that regulates publicly-traded companies.
Warren Buffett writes this letter every year, but has never read it aloud on a recording (to my knowledge). Therefore, my goal was to simulate what it would sound like if he had read his own words, at the same age as when he read the letter.
Most of the TTS voices, as you'll hear, sound like they are recorded from people ranging in age from early 20s to early 30s.
Ethics Discussion
- How does this compare to the "unjust enrichment" lawsuit against the estate of the actor Peter Cushing above?
- What concerns does it bring up for you, if any?
Female American
/Users/cryptogoth/src/warren-buffett-speech/1994/output-001.wav
Male Scottish/Irish
/Users/cryptogoth/src/value-investors-tts/berkshire-hathaway/1979/berkshire-hathaway-1979.mp3
Step 4. Finetuning
Finally, I found an option that let us use TTS to do "zero shot" finetuning, which essentially means just-in-time voice cloning with only one short voice sample to do the finetuning, rather than a long training process.
https://github.com/Edresson/YourTTS
Using a few minutes audio clip of Warren Buffett speaking audio from the Youtube video, we can speak the example text.
/Users/cryptogoth/src/warren-buffett-speech/tts_output-short.wav
and the text from the annual letter that you heard above, we can synthesize what he would sound like reading this new text.
This is the beginning text of the letter:
To the Shareholders of Berkshire Hathaway Inc.:
Our gain in net worth during 1994 was $1.45 billion or 13.9%.
Over the last 30 years (that is, since present management took
over) our per-share book value has grown from $19 to $10,083, or
at a rate of 23% compounded annually.
Doing voice cloning on a longer-sample
/Users/cryptogoth/src/warren-buffett-speech/1994/bh-1994-002.wav
and with a longer file
/Users/cryptogoth/src/warren-buffett-speech/1994/bh-1994.mp3
Compare again with the original
/Users/cryptogoth/src/value-investors-tts/voice-training-sets/berkshire-meetings/1994/long/berkshire-1994.mp3
/Users/cryptogoth/src/value-investors-tts/bpl/1957/bpl-1957.mp3
Ethics Discussion
- You can try recording your voice for a few minutes and do zero-shot voice cloning in the lab. What thoughts or feelings do you have right now, before hearing your synthesized voice?
- Should you have the right to synthesize your own voice? What are possible advantages or misuses of it?
- Photographs, video and audio recordings, and now possibly holograms are changing the way we remember loved ones who have died.
- If generative AI allows us another way to interact with a personality of a person in the past, how does this compare with historical re-enactors, or movies depicting fictional events with real people?