How to Access, Applications, and More


OpenAI has recently unveiled a suite of next-generation audio models, enhancing the capabilities of voice-enabled applications. These advancements include new speech-to-text (STT) and text-to-speech (TTS) models, offering developers more tools to create sophisticated voice agents. These advanced voice models, released on API, enable developers worldwide to build flexible and reliable voice agents much more easily. In this article, we will explore the features and applications of OpenAI’s latest GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS models. We’ll also learn how to access openAI’s audio models and try them out ourselves. So let’s get started!

OpenAI’s New Audio Models

OpenAI has introduced a new generation of audio models designed to enhance speech recognition and voice synthesis capabilities. These models offer improvements in accuracy, speed, and flexibility, enabling developers to build more powerful AI-driven voice applications. The suite includes 2 speech-to-text models and 1 text-to-speech model, which are:

  1. GPT-4o-Transcribe: OpenAI’s most advanced speech-to-text model, offering industry-leading transcription accuracy. It is designed for applications that require precise and reliable transcriptions, such as meeting and lecture transcriptions, customer service call logs, and content subtitling.
  2. GPT-4o-Mini-Transcribe: A smaller, lightweight, and more efficient version of the above transcription model. It is optimized for lower-latency applications such as live captions, voice commands, and interactive AI agents. It provides faster transcription speeds, lower computational costs, and a balance between accuracy and efficiency.
  3. GPT-4o-mini TTS: This model introduces the ability to instruct the AI to speak in specific styles or tones, making AI-generated voices sound more human-like. Developers can now tailor the agent’s voice tone to match different contexts like friendly, professional, or dramatic. It works well with OpenAI’s speech-to-text models, enabling smooth voice interactions.

The speech-to-text models come with advanced technologies such as noise cancellation. They are also equipped with a semantic voice activity detector that can accurately detect when the user has finished speaking. These innovations help developers handle a bunch of common issues while building voice agents. Along with these new models, OpenAI also announced that its recently launched Agents SDK now supports audio, which makes it even easier for developers to build voice agents.

Learn More: How to Use OpenAI Responses API & Agent SDK?

Technical Innovations Behind OpenAI’s Audio Models

The advancements in these audio models are attributed to several key technical innovations:

  • Pretraining with Authentic Audio Datasets: Leveraging extensive and diverse audio data has enriched the models’ ability to understand and generate human-like speech patterns.
  • Advanced Distillation Methodologies: These techniques have been employed to optimize model performance, ensuring efficiency without compromising quality.
  • Reinforcement Learning Paradigm: Implementing reinforcement learning has contributed to the models’ improved accuracy and adaptability in various speech scenarios.

How to Access OpenAI’s Audio Models

The latest model, GPT-4o-mini tts is available on a new platform released by open AI called Openai.fm. Here’s how you can access this model:

  1. Open the Website

    First, head to www.openai.fm.

  2. Choose the Voice and Vibe

    On the interface that opens up, choose your voice and set the vibe. If you can’t find the right character with the right vibe, click on the refresh button to get different options.

  3. Fine-tune the Voice

    You can further customize the chosen voice with a detailed prompt. Below the vibe options, you can type in details like accent, tone, pacing, etc. to get the exact voice you want.

  4. Add the Script and Play

    Once set, just type your script into the text input box on the right, and click on the ‘PLAY’ button. If you like what you hear, you can either download the audio or share it externally. If not, you can keep trying out more iterations till you get it right.

how to access GPT-4o-mini tts on openai.fm

The page requires no signup and you can play with the model as you like. Moreover, on the top right corner, there’s even a toggle that’ll give you the code for the model, fine-tuned to your choices.

Hands-on Testing of OpenAI’s Audio Models

Now that we know how to use the model, let’s give it a try! First, let’s try out the OpenAI.fm website.

1. Using GPT-4o-Mini-Transcribe on OpenAI.fm

Suppose I wish to build an “Emergency Services” Voice support agent.

For this agent, I select the:

  • Voice – Nova
  • Vibe – Sympathetic

Use the Following Instructions:

Tone: Calm, confident, and authoritative. Reassuring to keep the caller at ease while handling the situation. Professional yet empathetic, reflecting genuine concern for the caller’s well-being.

Pacing: Steady, clear, and deliberate. Not too fast to avoid panic but not too slow to delay response. Slight pauses to give the caller time to respond and process information.

Clarity: Clear, neutral accent with a well-enunciated voice. Avoid jargon or complicated terms, using simple, easy-to-understand language.

Empathy: Acknowledge the caller’s emotional state (fear, panic, etc.) without adding to it.

Offer calm reassurance and support throughout the conversation.

Use the Following Script:

“Hello, this is Emergency Services. I’m here to help you. Please stay calm and listen carefully as I guide you through this situation.”

“Help is on the way, but I need a bit of information to make sure we respond quickly and appropriately.”

“Please provide me with your location. The exact address or nearby landmarks will help us get to you faster.”

“Thank you; if anyone is injured, I need you to stay with them and avoid moving them unless necessary.”

“If there’s any bleeding, apply pressure to the wound to control it. If the person is not breathing, I’ll guide you through CPR. Please stay with them and keep calm.”

“If there are no injuries, please find a safe place and stay there. Avoid danger, and wait for emergency responders to arrive.”

“You’re doing great. Stay on the line with me, and I will ensure help is on the way and keep you updated until responders arrive.”

OpenAI's GPT-4o-mini TTS audio model application

Output:

Wasn’t that great? OpenAI’s latest audio models are now accessible through OpenAI’s API as well, enabling developers to integrate them into various applications.

Now let’s test that out.

2. Using gpt-4o-audio-preview via API

We’ll be accessing the gpt-4o-audio-preview model via OpenAI’s API and trying out 2 tasks: one for text-to-speech, and the other for speech-to-text.

Task 1: Text-to-Speech

For this task, I’ll be asking the model to tell me a joke.

Code Input:

import base64
from openai import OpenAI


client = OpenAI(api_key = "OPENAI_API_KEY")
completion = client.chat.completions.create(
   model="gpt-4o-audio-preview",
   modalities=["text", "audio"],
   audio={"voice": "alloy", "format": "wav"},
   messages=[
       {
           "role": "user",
           "content": "Can you tell me a joke about an AI trying to tell a joke?"
       }
   ]
)
print(completion.choices[0])
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("output.wav", "wb") as f:
   f.write(wav_bytes)

Response:

Task 2: Speech-to-Text

For our second task, let’s give the model this audio file and see if it can tell us about the recording.

Code Input:

import base64
import requests
from openai import OpenAI
client = OpenAI(api_key = "OPENAI_API_KEY")


# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')


completion = client.chat.completions.create(
   model="gpt-4o-audio-preview",
   modalities=["text", "audio"],
   audio={"voice": "alloy", "format": "wav"},
   messages=[
       {
           "role": "user",
           "content": [
               {
                   "type": "text",
                   "text": "What is in this recording?"
               },
               {
                   "type": "input_audio",
                   "input_audio": {
                       "data": encoded_string,
                       "format": "wav"
                   }
               }
           ]
       },
   ]
)
print(completion.choices[0].message)

Response:

gpt-4o-audio-preview output

Benchmark Results of OpenAI’s Audio Models

To assess the performance of its latest speech-to-text models, OpenAI conducted benchmark tests using Word Error Rate (WER), a standard metric in speech recognition. WER measures transcription accuracy by calculating the percentage of incorrect words compared to a reference transcript. A lower WER indicates better performance with fewer errors.

OpenAI's GPT-4o-Transcribe and GPT-4o-Mini-Transcribe, benchmarks

As the results show, the new speech-to-text models – gpt-4o-transcribe and gpt-4o-mini-transcribe – offer improved word error rates and enhanced language recognition compared to previous models like Whisper.

Performance on FLEURS Benchmark

One of the key benchmarks used is FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech), which is a multilingual speech dataset covering over 100 languages with manually transcribed audio samples.

OpenAI's GPT-4o-Transcribe and GPT-4o-Mini-Transcribe, benchmarks

The results indicate that OpenAI’s new models:

  • Achieve lower WER across multiple languages, demonstrating improved transcription accuracy.
  • Show stronger multilingual coverage, making them more reliable for diverse linguistic applications.
  • Outperform Whisper v2 and Whisper v3, OpenAI’s previous-generation models, in all evaluated languages.

Cost of OpenAI’s Audio Models

 

cost of openai audio models

Conclusion

OpenAI’s latest audio models mark a significant shift from purely text-based agents to sophisticated voice agents, bridging the gap between AI and human-like interaction. These models don’t just understand what to say—they grasp how to say it, capturing tone, pacing, and emotion with remarkable precision. By offering both speech-to-text and text-to-speech capabilities, OpenAI enables developers to create AI-driven voice experiences that feel more natural and engaging.

The availability of these models via API means developers now have greater control over both the content and delivery of AI-generated speech. Additionally, OpenAI’s Agents SDK makes it easier to transform traditional text-based agents into fully functional voice agents, opening up new possibilities for customer service, accessibility tools, and real-time communication applications. As OpenAI continues to refine its voice technology, these advancements set a new standard for AI-powered interactions.

Frequently Asked Questions

Q1. What are OpenAI’s new audio models?

A. OpenAI has introduced three new audio models—GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS. These models are designed to enhance speech-to-text and text-to-speech capabilities, enabling more accurate transcriptions and natural-sounding AI-generated speech.

Q2. How are OpenAI’s new audio models different from Whisper?

A. Compared to OpenAI’s Whisper models, the new GPT-4o audio models offer improved transcription accuracy and lower word error rates. It also offers enhanced multilingual support and better real-time responsiveness. Additionally, the text-to-speech model provides more natural voice modulation, allowing users to adjust tone, style, and pacing for more lifelike AI-generated speech.

Q3. What are the key features of OpenAI’s new text-to-speech (TTS) model?

A. The new TTS model allows users to generate speech with customizable styles, tones, and pacing. It enhances human-like voice modulation and supports diverse use cases, from AI voice assistants to audiobook narration. The model also provides better emotional expression and clarity than previous iterations.

Q4. How are GPT-4o-Transcribe and GPT-4o-Mini-Transcribe different?

A. GPT-4o-Transcribe offers industry-leading transcription accuracy, making it ideal for professional use cases like meeting transcriptions and customer service logs. GPT-4o-Mini-Transcribe is optimized for efficiency and speed, catering to real-time applications such as live captions and interactive AI agents.

Q5. What is OpenAI.fm?

A. OpenAI.fm is a web platform where users can test OpenAI’s text-to-speech model without signing up. Users can select a voice, adjust the tone, enter a script, and generate audio instantly. The platform also provides the underlying API code for further customization.

Q6. Can OpenAI’s Agents SDK help developers build voice agents?

A. Yes, OpenAI’s Agents SDK now supports audio, allowing developers to convert text-based agents into interactive voice agents. This makes it easier to create AI-powered customer support bots, accessibility tools, and personalized AI assistants with advanced voice capabilities.

Sabreena is a GenAI enthusiast and tech editor who’s passionate about documenting the latest advancements that shape the world. She’s currently exploring the world of AI and Data Science as the Manager of Content & Growth at Analytics Vidhya.

Login to continue reading and enjoy expert-curated content.

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Compare items
  • Total (0)
Compare
0