Audio Generation using Hugging Face

Last Updated : 30 Mar, 2026

Audio generation using models from Hugging Face enables developers to create synthetic audio such as speech, music, or sound effects using pretrained AI models. These models learn patterns from large audio datasets to generate realistic and high‑quality audio outputs from text or other inputs.

Audio-generation
Audio Generation

Implementation

Step 1: Set Up the Environment

First, install the required libraries. Run the following command in your command prompt.

pip install transformers torch torchaudio soundfile

  • transformers: Load and run pre trained text to speech models
  • torch: Core framework for tensor operations and inference
  • torchaudio: Audio related processing utilities
  • soundfile: Save generated audio to a file

Step 2: Import Required Libraries

Python
from transformers import AutoProcessor, AutoModelForTextToWaveform
import torch
import soundfile as sf

Step 3: Load Model and Preprocessor

In this step, we load facebook/mms-tts-eng, a pre trained Text to Speech model developed by Meta under the Massively Multilingual Speech (MMS) project. It is designed to generate natural sounding speech from text input.

  • Processor: Converts input text into numerical tokens formatted for the model.
  • Model: Generates raw waveform audio from the processed text input.
Python
model_name = "facebook/mms-tts-eng"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForTextToWaveform.from_pretrained(model_name)

Output:

pretrained-model-HF
Loading model and tokenizer

Step 4: Prepare Input Text

In this step, we provide the text that we want to convert into speech. The processor tokenizes the text and converts it into PyTorch tensors so it can be passed to the model for waveform generation.

Python
text = "Artificial intelligence can now generate realistic speech."

inputs = processor(text=text, return_tensors="pt")

Step 5: Generate Audio

Now the model generates waveform audio from the processed text input. The torch.no_grad() context disables gradient computation to optimize inference and reduce memory usage during audio generation.

Python
with torch.no_grad():
    output = model(**inputs)

Step 6: Convert to Audio and Save

In this step, the generated tensor is moved to the CPU, converted into a NumPy array and reshaped into a proper waveform format. The soundfile library then saves the waveform as a WAV file at a 16kHz sampling rate.

Python
audio = output.waveform.squeeze().cpu().numpy()

sf.write("manual_generated.wav", audio, 16000)

Output:

You can download the full code from here

Comment

Explore