Audio & speech: TTS, Whisper, MusicGen

Audio and speech: Whisper, TTS and MusicGen

Audio is just another sequence — a 1D signal sampled at 16 000 or 44 100 times per second. Modern audio models convert raw waveforms into a time-frequency representation called a spectrogram, then apply the same transformer architectures that power LLMs and image models. This lesson covers the three dominant audio generation paradigms: automatic speech recognition (Whisper), text-to-speech (TTS), and music generation (MusicGen).

Content is available with subscription.

Get full access to all courses on the platform for one year with a single payment.

Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.