OpenAI has unveiled a new iteration of their flagship language model, named GPT-4o (the "o" is a letter, not a zero), which can process and generate outputs in audio, image, and text formats. This version is dubbed ChatGPT 4o, where the "o" represents "omni," a prefix meaning "all."
GPT-4o (Omni)
OpenAI has introduced an advanced version of GPT-4, termed GPT-4o, designed to enhance the naturalness of interactions between humans and machines. This new model responds to inputs with the rapidity of human-to-human conversations. It performs on par with GPT-4 Turbo in English and surpasses Turbo in other languages. Additionally, this version boasts considerable enhancements in API performance, delivering faster speeds and operating at 50% lower costs.
From the announcement:
“As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.”
Advanced Voice Processing
Previously, voice communication with AI required a three-model system: one model to transcribe voice to text, a second model (like GPT-3.5 or GPT-4) to process the text and generate a response, and a third model to convert the text back to audio. This method often resulted in the loss of nuances during the various translation stages.
From the announcement:
“This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.”
OpenAI has outlined the limitations of this older approach, which are addressed by the new model. The latest version simplifies the process by using a single model to manage both audio inputs and outputs end-to-end. Notably, OpenAI has indicated that they have yet to fully explore the capabilities or understand the limitations of this new integrated approach.
New Guardrails and an Iterative Release
OpenAI GPT-4o introduces enhanced guardrails and filters to ensure safety and prevent unintended voice outputs. Today's announcement reveals that initially, the capabilities will be limited to text and image inputs with text outputs, and restricted audio functionalities at launch. GPT-4o will be accessible across both free and paid tiers, with Plus users benefiting from message limits that are five times higher.
The rollout of audio capabilities is scheduled for a limited alpha phase, available to ChatGPT Plus and API users in the coming weeks.
From the announcement: