Audio and speech: Whisper, TTS and MusicGen
Audio is just another sequence — a 1D signal sampled at 16 000 or 44 100 times per second. Modern audio models convert raw waveforms into a time-frequency representation called a spectrogram, then apply the same transformer architectures that power LLMs and image models. This lesson covers the three dominant audio generation paradigms: automatic speech recognition (Whisper), text-to-speech (TTS), and music generation (MusicGen).
Content is available with subscription.
Get full access to all courses on the platform for one year with a single payment.
▼
Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.