Movies often provide early insights into technological advances that have a good chance of becoming reality. A Trip to the Moon, inspired by Jules Verne’s literary works, discussed space travel in the early 1900s. 2001: A Space Odyssey, released in 1968, introduced the concept of an advanced supercomputer with artificial intelligence capable of reasoning and communicating in natural language with humans.
More recently, in 2013, Joaquin Phoenix starred in Her as Theodore Twombly. The movie, written and directed by Spike Jonze, tells the story of a lonely man with hardly any social life who begins to interact with a virtual assistant named Samantha. She possesses several unusual characteristics for a machine, such as a good sense of humor, empathy, desire, and a growing need for self-discovery. Twombly ends up falling in love with her.
When Her was released, Siri was the closest thing we had to an AI voice assistant. Apple’s ads presented this feature as innovative and intuitive. At the time, we saw Samuel Jackson using natural language to ask an iPhone 4s to find a nearby store to buy organic mushrooms or to tell him how many ounces are in a cup. This technology promised to make our lives easier, but it didn’t.
People quickly realized that speaking natural language to Siri or any other voice assistant was nearly impossible. The key to using them was to memorize a series of commands and pronounce them exactly as the system expected. Some people hoped this would improve over time as technology evolved, but others were less optimistic in the short term. A decade later, things hadn’t changed much. Until now.
When Sci-Fi Starts to Become Reality
We currently use the voice assistants built into our phones to do things like play music and set timers, but little else. AI-powered products such as the Rabbit R1 and the Humane AI Pin, which its creators claim to have a lot to offer, are still in the early stages. However, OpenAI has recently demonstrated something that may reignite the hopes of those seeking a voice assistant that goes beyond simply being a virtual companion.
ChatGPT has had a conversation mode for quite some time, allowing us to interact with the chatbot. However, this option has several drawbacks. The speech synthesis can feel too artificial, and the latency times of between 2.8 and 5.4 seconds hinder smooth interaction. OpenAI aims to overcome these limitations with its new model.
In the coming weeks, OpenAI will update ChatGPT to GPT-4o (the “o” stands for “omni,” which means it’s everywhere). We’re talking about a large language model that, unlike previous versions, has been fully trained to provide vision, text, and audio capabilities. Presumably, we’re also looking at a Mixture of Experts (MoE) type model, which aims at efficiency without losing capabilities. GPT-4o has an average latency of 320 milliseconds.
With this recent development, we’re looking at a very different ChatGPT than the one that was unveiled to the world in November 2022. According to OpenAI, the chatbot with GPT-4o has performance on par with GPT-4 Turbo in text intelligence, reasoning, and encoding. It boasts a variety of human-like characteristics, such as conversing naturally, laughing, singing, recognizing images, and even identifying the user’s sense of humor. In addition, it can interact in more than 50 languages.
We’re rapidly approaching what Jonze proposed in Her. Or at least, that's what it looked like after OpenAI's live demonstrations on Monday. In one of the videos, we can see one of the OpenAI members holding his iPhone with the ChatGPT app open. “Hey, how's it going?” he asks. ChatGPT replies to him using a female voice and describes quite accurately what it’s seeing, demonstrating its vision capabilities.
“I see you’re rocking an OpenAI hoodie. Nice choice,” the voice assistant adds. Then, something catches its eye (if that’s what we can call it) and it proceeds to ask "what’s up with the ceiling" and whether the user its interacting with is “in a cold industrial-style office or something.” The user then asks ChatGPT to guess what he’s doing there. “From what I can see, it looks like you’re in some kind of recording or production setup, with those lights, tripods, and possibly a mic. It seems like you might be getting up to shoot a video, or maybe even a livestream.”
The OpenAI team member responds that he’s making a new announcement. This appears to intrigue the AI, which starts starts speculating on the details of the announcement. “Is this announcement related to OpenAI, perhaps?” it asks. “What if I was to say that you’re related to the announcement?” the young man replies. “Me? The announcement is about me?” the system asks, demonstrating surprise about what the OpenAI team member said. The conversation is really interesting, especially if we take into account that we’re talking to a multimodal AI model.
That’s not all. OpenAI president Greg Brockman went on to demonstrate two AIs interacting and singing, showcasing the capabilities of GPT-4o. In the presentation, Brockman explains how one AI model will be able to communicate with another using natural language. One AI model is equipped with a camera to perceive the world and ask questions, while the other AI can ask questions but can’t see. “Well, well, well, just when I thought things couldn’t get any more interesting,” it replies.
Then Brockman switches to the other phone.
“There’s going to be another AI who’s going to talk to you. This AI is not going to be able to see anything but can ask you questions. […] Your job is just to be helpful. Just be as punchy, direct, describe everything, do whatever that AI asks.” Moments later, the AI models engage in a conversation, which you can see the video above. At some point, Brockman asks one of the AIs to sing a song about what it’s just seen and complete each other’s lines.
OpenAI’s latest model can be used in a multitude of situations. GPT-4o is designed to detect sarcasm, solve mathematical problems, provide instant translation, and more. This advancement in AI technology brings machines closer to human-like capabilities, something we once thought was pure sci-fi, and marks a significant breakthrough in the field. Once again, OpenAI appears to be leading the way in AI development.
Monday’s livestream included several other announcements. First, the gradual rollout of GPT-4 to all ChatGPT users has just begun. Users of the paid versions will have higher usage limits. It’s presumed that GPT-3.5 and GPT-4 will still be available, and users will be able to switch between models. The new voice system, however, will be exclusive to the paid versions and will arrive in alpha status in the coming weeks.
The company also announced a new desktop app for ChatGPT, which will initially only be available on macOS. This app will allow users to launch the chatbot at any time and ask it to use its vision capabilities to gather information from the screen. Additionally, users can invite ChatGPT to join a video conference and interact with the participants.
Rumors suggest that Apple has finalized an agreement with OpenAI, led by CEO Sam Altman, to integrate the company’s technology into some iOS 18 features. Could this technology enhance the iPhone's voice assistant? Perhaps we’ll find out more at WWDC on June 10.
GPT-4o isn't quite at the level of Her's Samantha yet, which was able do carry out a variety of tasks for users. In the movie, Samantha could make phone calls, check emails, organize files, and even book an Uber. While features like this would be useful, they would also raise significant security and privacy concerns.
Image | Warner Bros. Pictures | OpenAI
Related | Spies Also Wanted to Use ChatGTP. Microsoft Created One for Them