OpenAI subverts the world: GPT-4o real-time voice and video interaction shocked the audience and directly entered the era of science fiction

So shocking!

While various technology companies are still catching up with the multi-modal capabilities of large models and putting summary text, P-pictures and other functions into mobile phones, OpenAI, which is far ahead, directly launched a big move and released a product that even its own CEO Ultraman Marvel: Just like in the movies.

In the early morning of May 14, OpenAI launched its new generation flagship generation model GPT-4o and desktop App at its first “Spring New Product Launch” and demonstrated a series of new capabilities. This time, technology has subverted product forms, and OpenAI has used actions to teach technology companies around the world a lesson.

Today’s host is Mira Murati, chief technology officer of OpenAI. She said that today she will mainly talk about three things:

First, in the future, OpenAI’s products will be free first, so that more people can use them.
Second, OpenAI has released a desktop version of the program and an updated UI that is easier and more natural to use.
Third, after GPT-4, a new version of the large model came, named GPT-4o. What’s special about GPT-4o is that it brings GPT-4 level intelligence to everyone, including free users, in an extremely natural interactive way.

After this update of ChatGPT, large models can receive any combination of text, audio, and images as input, and generate any combination of text, audio, and image output in real time—this is the future of interaction.

Recently, ChatGPT can be used without registration. Today, a desktop program has been added. OpenAI’s goal is to allow people to use it anytime, anywhere without any sense, and integrate ChatGPT into your workflow. This AI is now productivity.

GPT-4o is a new large-scale model facing the future human-computer interaction paradigm. It has the ability to understand three modes: text, voice, and image. It responds very quickly, has emotions, and is very humane.

At the scene, OpenAI engineers took out an iPhone to demonstrate several major capabilities of the new model. The most important thing is the real-time voice conversation. Mark Chen said: “It’s my first time to attend a live conference, so I’m a little nervous.” ChatGPT said, why don’t you take a deep breath.

“Okay, I’ll take a deep breath.”

ChatGPT immediately replied: “You can’t do this, you’re breathing too much.”

If you’ve used a voice assistant like Siri before, you’ll notice a clear difference here. First, you can interrupt the AI at any time and continue the conversation without waiting for it to finish. Secondly, you don’t have to wait, the model responds extremely quickly, faster than human response. Third, the model can fully understand human emotions and can express various emotions itself. Next comes visual ability. Another engineer wrote the equation on paper, and instead of giving the answer directly, ChatGPT asked it to explain how to do it step by step. It seems to have great potential in teaching people to do questions.

*ChatGPT says, whenever you are struggling with math, I will be by your side.*

Next try out the coding capabilities of GPT-4o. There is some code here. Open the desktop version of ChatGPT on your computer and interact with it using voice. Ask it to explain what the code is used for and what a certain function is doing. ChatGPT will answer the questions fluently.

The result of the output code is a temperature graph, allowing ChatGPT to respond to all questions about this graph in one sentence.

It can answer which month the hottest month is and whether the Y-axis is in degrees Celsius or Fahrenheit.

OpenAI also responded to questions raised in real time by some X/Twitter netizens. For example, real-time voice translation, the mobile phone can be used as a translator to translate Spanish and English back and forth.

Someone else asked, can ChatGPT recognize your expressions? It seems that GPT-4o is already capable of real-time video understanding.

Next, let us take a closer look at the nuclear bomb released by OpenAI today.

Universal model GPT-4o

The first one introduced is GPT-4o, where o stands for Omnimodel.

For the first time, OpenAI integrates all modalities in one model, greatly improving the practicality of large models. OpenAI CTO Muri Murati said that GPT-4o provides “GPT-4 level” intelligence, but improves text, visual and audio capabilities based on GPT-4, and will be “iteratively” implemented in the next few weeks. Launched in company products.

“The rationale for GPT-4o spans speech, text and vision,” said Muri Murati. “We know these models are getting more complex, but we want the interaction experience to become more natural and simpler, so that you don’t have to pay attention to the user interface at all. And just focus on collaboration with GPT.”

GPT-4o’s performance on English text and code matches that of GPT-4 Turbo, but significantly improves performance on non-English text, while the API is faster and 50% cheaper. GPT-4o particularly excels in visual and audio understanding compared to existing models.

It can respond to audio input in as little as 232 milliseconds, with an average response time of 320 milliseconds, similar to humans. Prior to the release of GPT-4o, users who experienced ChatGPT’s voice conversation capabilities experienced average ChatGPT latency of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

This speech response model is a pipeline of three independent models: a simple model transcribes audio to text, GPT-3.5 or GPT-4 receives text and outputs text, and a third simple model converts that text back to audio. But OpenAI found that this approach meant that GPT-4 lost a lot of information. For example, the model couldn’t directly observe pitch, multiple speakers, or background noise, and it couldn’t output laughter, singing, or expressions of emotion.

On GPT-4o, OpenAI trained a new model end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network.

“From a technical perspective, OpenAI has found a way to map audio directly to audio as a first-level modality and transmit video to the transformer in real time. These require some new research on tokenization and architecture, but Overall it’s a matter of data and system optimization (as is the case with most things),” commented Jim Fan, a scientist at NVIDIA.

GPT-4o enables real-time reasoning across text, audio, and video, which is an important step toward more natural human-machine interaction (and even human-machine-machine interaction).

OpenAI President Greg Brockman also “completed the work” online, not only allowing two GPT-4o to talk in real time, but also letting them improvise a song. Although the melody was a bit “touching”, the lyrics covered the decoration style of the room, the character’s clothing characteristics, and Interludes that occurred during the period, etc. In addition, GPT-4o is much better at understanding and generating images than any existing model, making many previously impossible tasks “easy”.

For example, you can ask it to help print the OpenAI logo on coasters:

After this period of technical research, OpenAI should have perfectly solved the problem of ChatGPT generating fonts. At the same time, GPT-4o also has the ability to generate 3D visual content and can perform 3D reconstruction from 6 generated images:

Here’s a poem that GPT-4o can format in handwriting style:

81715634306

101715634308

More complex layout styles can also be handled:

71715634310

Working with GPT-4o, you only need to enter a few paragraphs of text and you will get a set of continuous comic storyboards:

The following gameplay methods should surprise many designers:

This is a stylized poster evolved from two life photos:

There are also some niche functions, such as “Text to WordArt”:

GPT-4o performance evaluation results

Members of the OpenAI technical team stated on

On the more difficult prompt sets – especially in terms of encoding: GPT-4o’s performance improvement is particularly significant compared to OpenAI’s previous best model.

Specifically, across multiple benchmarks, GPT-4o achieves GPT-4 Turbo-level performance in text, reasoning, and coding intelligence, while achieving new highs in multilingual, audio, and visual capabilities.

*Reasoning improvements: GPT-4o achieved a new high score of 87.2% on 5-shot MMLU (common sense questions). (Note: Llama3 400b is still in training)*

*Audio ASR performance: GPT-4o significantly improves speech recognition performance for all languages compared to Whisper-v3, especially for low-resource languages.*

*GPT-4o achieves new state-of-the-art performance in speech translation and outperforms Whisper-v3 on MLS benchmarks.*

The M3Exam benchmark is both a multilingual and visual assessment benchmark, consisting of standardized test multiple-choice questions from multiple countries and includes graphs and charts. GPT-4o is stronger than GPT-4 in all language benchmarks.

In the future, improvements in model capabilities will enable more natural, real-time voice conversations and the ability to talk to ChatGPT via real-time video. For example, a user can show ChatGPT a live sports match and ask it to explain the rules.

ChatGPT users will get more advanced features for free

More than 100 million people use ChatGPT every week, and OpenAI says GPT-4o’s text and image capabilities are starting to roll out in ChatGPT for free today, with up to 5x the message limit available to Plus users.

Now open ChatGPT, we find that GPT-4o is already available.

ChatGPT free users now have access to the following features when using GPT-4o: Experience GPT-4 level intelligence; users can get responses from models and the network. In addition, free users have the following options – analyze data and create charts:

Talk to the photos you took:

Upload a file for help with summarizing, writing, or analyzing:

Discover and use GPTs and the GPT App Store:

and using memory features to create a more helpful experience.

However, the number of messages free users can send using GPT-4o is limited based on usage and demand. When the limit is reached, ChatGPT will automatically switch to GPT-3.5 so users can continue conversations. Additionally, OpenAI will launch a new version of speech mode GPT-4o alpha in ChatGPT Plus in the coming weeks, as well as more new audio and video features for GPT-4o via API to a small group of trusted partners.

Of course, through multiple model tests and iterations, GPT-4o has some limitations in all modes. Amid these imperfections, OpenAI says it is working to improve GPT-4o.

It is conceivable that the opening of the GPT-4o audio mode will definitely bring various new risks. On the issue of security, GPT-4o has security built into the cross-modal design through techniques such as filtering training data and refining model behavior after training. OpenAI has also created a new security system to protect speech output.

New desktop app streamlines user workflow

For free and paid users, OpenAI is also launching a new ChatGPT desktop app for macOS. Users can instantly ask ChatGPT questions with a simple keyboard shortcut (Option + Space), plus they can take screenshots and discuss them directly within the app.

Users can now also have voice conversations with ChatGPT directly from their computer, with GPT-4o’s audio and video capabilities coming in the future, by clicking on the headset icon in the lower right corner of the desktop app to start a voice conversation.

OpenAI is rolling out the macOS app to Plus users starting today, and will make it more widely available in the coming weeks. In addition, OpenAI will launch a Windows version later this year.

“Her” is coming

Although Sam Altman did not appear at the conference, he published a blog after the conference and posted the word X: her. This is obviously an allusion to the classic science fiction movie “Her” of the same name. This is also the first image that came to mind when I watched the presentation of this conference.

Samantha in the movie “Her” is not just a product, she even understands humans better than humans and is more like humans themselves. You can really gradually forget that she is an AI when communicating with her.

This means that the human-computer interaction model may usher in a truly revolutionary update after the graphical interface, as Sam Altman said in his blog:

“The new voice (and video) mode is the best computer interface I’ve ever used. It feels like the AI in a movie; and I’m still a little surprised it’s real. Human-level response times and expressiveness turned out to be A big change.”

The previous ChatGPT gave us the first glimpse of natural user interfaces: Simplicity above all else: Complexity is the enemy of natural user interfaces. Every interaction should be self-explanatory, requiring no instruction manual.

But the GPT-4o released today is completely different. It is almost latency-free, smart, fun, and practical. Our interaction with computers has never really experienced such a natural and smooth interaction.

There are still huge possibilities hidden here. When more personalized functions and collaboration with different terminal devices are supported, it means that we can use mobile phones, computers, smart glasses and other computing terminals to do many things that were not possible before.

AI hardware will no longer try to accumulate. What is more exciting now is that if Apple officially announces its cooperation with OpenAI at WWDC next month, the iPhone experience may be improved more than any conference in recent years.

NVIDIA senior scientist Jim Fan believes that cooperation between iOS 18, known as the largest update in history, and OpenAI may have three levels:

Ditching Siri, OpenAI has refined a small GPT-4o for iOS that runs purely on-device, with the option to pay to upgrade to cloud services.
Native functionality feeds camera or screen streams into the model. Chip-level support for neural audio and video codecs.
Integrates with iOS system-level operations API and smart home API. No one uses Siri Shortcuts, but it’s time for a renaissance. This could become an AI agent product with a billion users right out of the gate. This is like a Tesla-like full-size data flywheel for smartphones.

Ultraman: You open source, we make it free

After the release, OpenAI CEO Sam Altman published a blog post for the first time in a long time, introducing the process of promoting GPT-4o work: In our release today, I want to emphasize two things.

First, a key part of our mission is to make powerful AI tools available to people for free (or at a reduced price). I’m very proud to announce that we offer the best models in the world in ChatGPT for free, without ads or anything like that.

When we founded OpenAI, our original vision was: We were going to create artificial intelligence and use it to create a variety of benefits for the world. Now things have changed and it looks like we will create artificial intelligence and then other people will use it to create all kinds of amazing things and we will all benefit from it.

Of course, we are a business and will invent a lot of things for a fee that will help us deliver free, great AI services to billions of people (hopefully).

Second, the new voice and video modes are the best computing interfaces I’ve ever used. It feels like an AI in a movie, and I’m still a little surprised that it’s actually real. It turns out that reaching human-level response times and expressiveness is a giant leap.

The original ChatGPT hinted at the possibilities of a language interface, but this new thing (version GPT-4o) feels fundamentally different – it’s fast, smart, fun, natural, and helpful.

Interacting with computers has never come naturally to me, that’s the truth. And when we add the ability for (optional) personalization, access to personal information, having AI take actions on a person’s behalf, and more, I can really see an exciting future where we’ll be able to do more with computers than ever before.

Finally, a huge thank you to the team for working so hard to make this happen!

It is worth mentioning that Altman said in an interview last week that although universal basic income is difficult to achieve, we can achieve “universal basic compute for free.” In the future, everyone will have free access to GPT’s computing power, which can be used, resold, or donated.

“The idea is that as AI becomes more advanced and embedded in every aspect of our lives, having a large language model unit like GPT-7 may be more valuable than money, and you have part of the productivity.” Altman explained road.

The release of GPT-4o may be the beginning of OpenAI’s efforts in this regard.

Yes, this is just the beginning.

Last but not least, the video of “Guessing May 13th’s announcement.” displayed on the OpenAI blog today almost completely crashed into a warm-up video for Google’s I/O conference tomorrow. This is undoubtedly a flattering response to Google. I wonder if Google feels tremendous pressure after reading today’s OpenAI release?

Reference content:
https://openai.com/index/hello-gpt-4o/
https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/
https://blog.samaltman.com/gpt-4o
https://www.businessinsider.com/openai-sam-altman-universal-basic-income-idea-compute-gpt-7-2024-5