Audio generation using models from Hugging Face enables developers to create synthetic audio such as speech, music, or sound effects using pretrained AI models. These models learn patterns from large audio datasets to generate realistic and high‑quality audio outputs from text or other inputs.

Implementation
Step 1: Set Up the Environment
First, install the required libraries. Run the following command in your command prompt.
pip install transformers torch torchaudio soundfile
- transformers: Load and run pre trained text to speech models
- torch: Core framework for tensor operations and inference
- torchaudio: Audio related processing utilities
- soundfile: Save generated audio to a file
Step 2: Import Required Libraries
from transformers import AutoProcessor, AutoModelForTextToWaveform
import torch
import soundfile as sf
Step 3: Load Model and Preprocessor
In this step, we load facebook/mms-tts-eng, a pre trained Text to Speech model developed by Meta under the Massively Multilingual Speech (MMS) project. It is designed to generate natural sounding speech from text input.
- Processor: Converts input text into numerical tokens formatted for the model.
- Model: Generates raw waveform audio from the processed text input.
model_name = "facebook/mms-tts-eng"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForTextToWaveform.from_pretrained(model_name)
Output:

Step 4: Prepare Input Text
In this step, we provide the text that we want to convert into speech. The processor tokenizes the text and converts it into PyTorch tensors so it can be passed to the model for waveform generation.
text = "Artificial intelligence can now generate realistic speech."
inputs = processor(text=text, return_tensors="pt")
Step 5: Generate Audio
Now the model generates waveform audio from the processed text input. The torch.no_grad() context disables gradient computation to optimize inference and reduce memory usage during audio generation.
with torch.no_grad():
output = model(**inputs)
Step 6: Convert to Audio and Save
In this step, the generated tensor is moved to the CPU, converted into a NumPy array and reshaped into a proper waveform format. The soundfile library then saves the waveform as a WAV file at a 16kHz sampling rate.
audio = output.waveform.squeeze().cpu().numpy()
sf.write("manual_generated.wav", audio, 16000)
Output:
You can download the full code from here