Audio Generation using Hugging Face

Audio generation using models from Hugging Face enables developers to create synthetic audio such as speech, music, or sound effects using pretrained AI models. These models learn patterns from large audio datasets to generate realistic and high‑quality audio outputs from text or other inputs.

Implementation

Step 1: Set Up the Environment

First, install the required libraries. Run the following command in your command prompt.

pip install transformers torch torchaudio soundfile

transformers: Load and run pre trained text to speech models
torch: Core framework for tensor operations and inference
torchaudio: Audio related processing utilities
soundfile: Save generated audio to a file

Step 2: Import Required Libraries

Python

from transformers import AutoProcessor, AutoModelForTextToWaveform
import torch
import soundfile as sf

Step 3: Load Model and Preprocessor

In this step, we load facebook/mms-tts-eng, a pre trained Text to Speech model developed by Meta under the Massively Multilingual Speech (MMS) project. It is designed to generate natural sounding speech from text input.

Processor: Converts input text into numerical tokens formatted for the model.
Model: Generates raw waveform audio from the processed text input.

Python

model_name = "facebook/mms-tts-eng"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForTextToWaveform.from_pretrained(model_name)

Output:

pretrained-model-HF — Loading model and tokenizer

Step 4: Prepare Input Text

In this step, we provide the text that we want to convert into speech. The processor tokenizes the text and converts it into PyTorch tensors so it can be passed to the model for waveform generation.

Python

text = "Artificial intelligence can now generate realistic speech."

inputs = processor(text=text, return_tensors="pt")

Step 5: Generate Audio

Now the model generates waveform audio from the processed text input. The torch.no_grad() context disables gradient computation to optimize inference and reduce memory usage during audio generation.

Python

with torch.no_grad():
    output = model(**inputs)

Step 6: Convert to Audio and Save

In this step, the generated tensor is moved to the CPU, converted into a NumPy array and reshaped into a proper waveform format. The soundfile library then saves the waveform as a WAV file at a 16kHz sampling rate.

Python

audio = output.waveform.squeeze().cpu().numpy()

sf.write("manual_generated.wav", audio, 16000)

Output:

You can download the full code from here

Audio Generation using Hugging Face

Implementation

Step 1: Set Up the Environment

Step 2: Import Required Libraries

Step 3: Load Model and Preprocessor

Step 4: Prepare Input Text

Step 5: Generate Audio

Step 6: Convert to Audio and Save

Explore