OpenAI Releases GPT-Realtime-2 and Whisper v4 with Sub-200ms Voice Latency

On May 8, OpenAI released GPT-Realtime-2, a new real-time voice model with dramatically reduced latency and improved emotional range, alongside Whisper v4, which sets a new accuracy benchmark for multilingual speech recognition.

Key points:

• GPT-Realtime-2 achieves sub-200ms end-to-end latency, making voice interaction feel genuinely conversational rather than transactional.

• Whisper v4 reduces word error rate by 31% across 20 languages and adds speaker diarization, enabling it to identify and label multiple speakers in a single recording.

• Both models are available via API immediately, with GPT-Realtime-2 priced at $0.06 per minute of audio.

Real-time voice AI at this quality level unlocks a new category of applications: live meeting assistants, real-time translation, and voice-first productivity tools that were previously too laggy to be useful. Whisper v4's multilingual accuracy improvements are particularly significant for global enterprise deployment, where transcription errors in non-English languages have been a persistent barrier.

For product builders, GPT-Realtime-2 is now the default recommendation for any voice-first application. Build voice interfaces into your tools before your competitors do. Evaluate Whisper v4 for meeting transcription pipelines, especially if your organization operates across multiple languages. The accuracy jump makes it enterprise-viable.

Why It Matters: Sub-200ms voice latency enables genuinely conversational AI applications that were previously impossible, while 31% accuracy improvements in multilingual transcription remove a major barrier to global enterprise deployment.